We encourage all students to use the Conda package management for a better environment manegement. The Conda installation can be found here https://www.anaconda.com/products/individual.
The default environment usually have too many pre-installed packages, which may conflict the packages we want to install. Therefore, we create a brand new environment by the following command:
conda create -n cs533r python=3.8
This command creates a Python3.8 environment named cs533r
.
Then we can use the command
conda activate cs533r
to activate cs533r
environment.
We have included all packages needed in requirements.txt
. You can use
conda install --file requirements.txt
to automatically install all required packages.
As our assignments are written in Jupyter, we use
jupyter lab
to open the jupyter lab in our browser.
This machine is used to submit heavy jobs. In general, it needs to wait some time until the job is accepted to run. Therefore, it is very unrecommanded to run Jupyter on this machine.
To access xxx@lin*.student.cs.ubc.ca
, we need first get into the intermediate platform by
ssh xxx@remote.student.cs.ubc.ca
Then we can access the server by
ssh xxx@lin*.student.cs.ubc.ca
where * can be 03-25.
Our working directory is /hpc/cs-533r/students
. We can use
cd /hpc/cs-533r/students
to access it. We need to create a folder under this directory as our own working folder. All packages and data should go in this folder Please name the folder by the name of the group for convinience. We take xxx
as an example.
mkdir xxx
cd xxx
The conda environment can be installed by previous instructions.
Let's start from the hello world script hello.sh
#!/bin/bash
#SBATCH --time=00:01:00 #### date format is DD:HH:MM:SS
#SBATCH --account=hpc-cpsc533r #### account should not be changed
#SBATCH --partition=CPSC533R #### partition should not be changed
echo 'Hello, world!'
sleep 30
After saving this to hello.sh
, we can use
sbatch hello.sh
to submit the job.
You will be able to see your job running (as it sleeps for 30 seconds) by running the command:
squeue
To cancel this job before it finishes, you could get job_id from the output of squeue and run the command:
scancel job_id
A common training script should look like this:
#!/bin/bash
#SBATCH --gres=gpu:1 ##### how many GPUs to use
#SBATCH --cpus-per-task=6 ##### how many CPUs to use
#SBATCH --mem=32G ##### Memory required for each CPU
#SBATCH --time=00:10:00 ##### Training time
#SBATCH --account=hpc-cpsc533r
#SBATCH --partition=CPSC533R
source path_to_your_conda_environment/bin/activate
python test.py
We use source
to activate the conda environment, and then simply run your Python script.
In the following example, we want to submit 3 jobs at the same time, with 3 different learning rates.
#!/bin/bash
#SBATCH --gres=gpu:1 ##### how many GPUs to use
#SBATCH --cpus-per-task=6 ##### how many CPUs to use
#SBATCH --mem=32G ##### Memory required for each CPU
#SBATCH --time=00:10:00 ##### Training time
#SBATCH --array=0-2 ##### Job index
#SBATCH --account=hpc-cpsc533r
#SBATCH --partition=CPSC533R
source path_to_your_conda_environment/bin/activate
learning_rates=(0.1 0.01 0.001)
lr=${learning_rates[$SLURM_ARRAY_TASK_ID]}
python test.py --learning_rate $lr
We use #SBATCH --array=0-2
to indicate that we need 3 jobs and their index is 0, 1, 2. We can access the job index by the variable SLURM_ARRAY_TASK_ID
. In this script, we pass learning rates 0.1, 0.01, 0.001, to job 0, 1, 2, respectively.