WestGrid - quick user guide

This page is part of the EmpiricalAlgorithmics web.

WestGrid - quick user guide

Introduction

WestGrid operates high performance computing (HPC), collaboration and visualization infrastructure across western Canada. It encompasses 14 partner institutions across four provinces.
An extensive overview of WestGrid can be found at the WestGrid website. You also can read the QuickStart Guide for New Users at http://www.westgrid.ca/support/quickstart/new_users

How to get a WestGrid account?

Lead researcher: You will select this option if you are the leader of a project (research group). You will be asked to enter some information about the nature of your research project before you apply for your user account. You will be given a Project ID Number that other collaborators in your group may cite when applying for their user accounts. Note that only faculty members can be project leaders.
Join an existing project (research group): This option is for researchers that are supervised by a leading researcher (for example, a student who is working for a professor.) To join a pre-existing group, you will need to obtain the Project ID Number from the project's leader and enter the number on the form. The project leader will be asked to verify the information you submit. You can look up project ID numbers using the web page https://rsg.nic.ualberta.ca/project_lookup.php.

To apply for an account, proceed to the Account Request page https://rsg.nic.ualberta.ca/.

What will you get in the next a few days?

After you submit your application, you will get a few e-mails from WestGrid.

WestGrid Account Application Received: WestGrid Account Management received your application.
Asking Permission from Project Leader: If you are willing to join an existing project (usually is the case for students), WestGrid will send an e-mail to the project leader asking for conformation.
WestGrid Application Accepted: Your application for a WestGrid account has been approved.
WestGrid account created: Westgrid has set up a Westgrid account for you on silo.westgrid.ca and hopper.westgrid.ca. Those are storage servers for medium and long-term data storage. For more information about using Silo and Hopper, please visit http://westgrid.ca/support/quickstart/silo. Note: The shell on Silo is restricted; it can only be used for managing and downloading files. You cannot run programs or scripts on Silo.
Welcome to cluster name_: Your account on the _cluster name has been activated. In my case, the cluster name is glacier.westgrid.ca. Note: the file system for storage and cluster is different. You can not directly access files stored in storage server from cluster. You will need to use gcp to copy files between different WestGrid machines.

How to transfer my files to/between WestGrid?

Assume my host machine is okanagan.cs.ubc.ca, my WestGrid storage server is silo.westgrid.ca and my cluster in WestGrid is glacier.westgrid.ca. The file I want transfer is test.txt.

Transfer files between WestGrid machines (from glacier.westgrid.ca to silo.westgrid.ca)
```
    gcp test.txt username@silo.westgrid.ca:~/
   
```

Transfer files between your local machine to WestGrid (from okanagan.cs.ubc.ca to glacier.westgrid.ca)

    okanagan:> scp test.txt username@glacier.westgrid.ca:~/
    username@glacier.westgrid.ca's password: password

If you want write a script to transfer many files from your local machine to WestGrid, entering password will be a problem. Here is the solution: First log in on okanagan.cs.ubc.ca as user username and generate a pair of authentication keys. Do not enter a passphrase:

    okanagan:~> ssh-keygen -t rsa
    Generating public/private rsa key pair.
    Enter file in which to save the key (/ubc/cs/home/username/.ssh/id_rsa): 
    Enter passphrase (empty for no passphrase): 
    Enter same passphrase again:  
    Your identification has been saved in /ubc/cs/home/username/.ssh/id_rsa.
    Your public key has been saved in /ubc/cs/home/username/.ssh/id_rsa.pub.
    The key fingerprint is:
    0e:97:88:0f:86:70:39:8f:44:13:e3:f4:5f:79:32:cd username@okanagan

Go to ~./ssh and transfer id_rsa.pub to glacier.westgrid.ca under .ssh in your home directory.

    scp id_rsa.pub xulin730@glacier.westgrid.ca:~/.ssh/

Add id_rsa.pub to authorized_keys2

    cat id_rsa.pub >> authorized_keys2

Now, try to use scp to transfer files (no password required).

Running Jobs

A great majority of the computational work on WestGrid systems is carried out through non-interactive batch processing. Job scripts containing commands to be executed are submitted from a login server to a batch job handling system, which queues the requests, allocates processors and starts and manages the jobs. The system software that handles your batch jobs consists of two pieces: a resource manager (TORQUE) and a scheduler (Moab). This system is fairly similar to our SunGridEngine. For detailed information, please visit http://westgrid.ca/support/running_jobs.

A batch job script is a text file of commands for the UNIX shell to interpret, similar to what you could execute by typing directly at a keyboard. The job is submitted to an queue using the qsub command. A job will wait in the queue depending on factors such as system load and the priority assigned to the job. When appropriate resources become available to run a job, it started on one or more assigned processors. A job will be terminated if it exceeds its allotted time limit, or, on some systems, if it exceeds memory limits. By default, the standard output and error streams from the job are directed to files in the directory from which the job was submitted. For detailed information of how to write a job script, please visit http://westgrid.ca/support/running_jobs#directives

A few useful commands:

qstat: Check the status of the cluster
qsub: Submit jobs to the queue (You can also submit array job such as qsub -t 1-100)
qdel: Delete you own jobs in case of something wrong

A few notes if you are using glacier.westgrid.ca (http://guide.westgrid.ca/guide-pages/jobs.html):

PBS -l qos=debug: For testing your scripts without waiting.
PBS -l nodes=1,software=MATLAB: For submitting MATLAB jobs. glacier.westgrid.ca has 20 MATLAB licenses.
http://guide.westgrid.ca/guide-pages/queue_state: For cluster utilization

Scheduler (Fairshare & RAC)

The WestGrid job scheduler is priority queue with a back fill mechanism. The scheduler will dispatch the highest priority job in the queue if there are sufficient resources for it to run. If there are insufficient resources to submit the highest priority job the scheduler will find the next highest priority job whose execution will not overlap with the approximate* earliest start of the original job (* since jobs can finish before their time cutoff the scheduler is using an upper bound of the earliest start time for a job). A job's priority is a weighted sum of several components. The most important of components (by weight) are requested resources and fairshare.

Resource Usage

At the moment the only resource request that will affect your jobs priority is the number of processors you request. This means the amount of time or memory you request will have no impact on your job's priority; though memory intensive runs are harder to dispatch regardless of priority.

Somewhat counter-intuitively (at first) asking for more processors will increase your jobs priority. This is done to improve over-all cluster performance; multi-node jobs are far less likely to get dispatched by the back fill mechanism so must be given a higher priority to compensate. The current contribution to your jobs priority on Glacier and Orcinus is 100*<# of requested processors>.

Quotas

Your disk quotas are based on the number of files, and not just the amount of disk space you use.

To check your quota, on glacier, execute the command:

/usr/lpp/mmfs/bin/mmlsquota where gpfs1 is glacier, gpfs2 is scratch, and gpfs3 is orcinus

A handy command line to see the number of files in your directories recursively is:

find . -type d -exec sh -c "echo -n {}; ls {} | wc" \;

Fairshare (& RAC)

A user's (or account's) fairshare value is weighted average of cluster usage in a set of disjoint time windows. For example; Orcinus and Glacier use 7 time windows that each last 36 hours with the following weights:

window	w1	w2	w3	w4	w5	w6	w7
weight	1.0	0.9	0.81	0.73	0.66	0.59	0.53

Note that these are not sliding windows. There is a set time where the current window ends and is rolled over. So for example 30 hours into the current window some user has 10% cluster use for w1 (current window) and 30% usage for windows w2 & w3 and 0% for all others their current fairshare value will be 0.108. If they stop using the cluster at this point their fairshare value will reset to 0 after 252+6 hours.

One important note; a user's/account's cluster usage in a time window is a % of total usage of the cluster in the time window NOT a % of the available resources. So if the cluster is used by only 1 user in a time window they will be treated as having 100% usage for the window regardless of how many nodes they actually use.

The fairshare value is used in conjunction with a user's/account's fairshare target. Without an RAC this is set to be about 1-2% of a cluster. With an RAC this is set to whatever was awarded (i.e. 300 node RAC on Orcinus would give a target of ~10%). The fairshare component of a priority is then:

Without an RAC:

          (FS Weight) 
              * (FS User Weight) * ((FS User Target) - (FS User Value))
              * (FS Account Weight) * ((FS Account Target) - (FS Account Value))

With an RAC:

          (FS Weight) 
              * (FS User Weight) * ((FS User Target) - (FS User Value)) 
              * (FS Account Weight) * MAX(0, (FS Account Target) - (FS Account Value))

(difference => accounts with an RAC are not penalized for going over target)

On Orcinus and Glacier:

FS Weight = 100
FS User Weight = 50
FS Account Weight = 100

The last important note is that fairshare values are specific to individual clusters on WestGrid; i.e. using Orcinus heavily will not affect your (or your group's) priority on Glacier or Lattice.

Topic revision: r3 - 2011-03-23 - DaveTompkins