Skip to content

HOWTOs

Note on Remote Access

If you are using PolyU Wi-Fi or remote network, please make sure your device is connected to the Staff VPN (Staff) or Research VPN (Student) before connecting to login node.

Information of Staff VPN can be referred at here and Research VPN can be referred at here

Please also check this page for the requirement of your system when connecting to the Staff VPN or Research VPN. Usually, the latest security patches of the system and a supported anti-virus software with the latest definition updates are required.

Note

This page is applied to Nvidia platform only. For Huawei platform, please visit the relevant course in Blackboard (under course "COMP_APULIS_AI_20240 Apulis AI Studio").

Table of contents

Connecting to the Login Node

  1. Use your favorite SSH client program, connecting to host hpclogin.comp.polyu.edu.hk.
  2. Enter your COMP ID and password. If you have not activated it or forgot your password, please activate it or reset password at https://acct.comp.polyu.edu.hk.
  3. You are now logged in the login node.



Using SLURM (via SSH)

Load SLURM module

The SLURM module must be loaded before using SLURM.

To check SLURM is loaded in the current session, you can execute below command:

module list

If SLURM is loaded, you will see an entry like slurm/slurm/23.02.7. The possible output is:

Currently Loaded Modulefiles:
 1) slurm/slurm/23.02.7

If SLURM module is not loaded in the current session, you can execute below command to load SLURM:

module load slurm



Submit job to compute node using srun

User can submit a job to compute node by using srun command. The srun command will block your terminal and execute your desired command on compute node when resources are available. If resources are not available, the job will be in PENDING state. SLURM will put your job to the job queue. You can type Ctrl + C from your keyboard to cancel the job.

The syntax of the command is:

srun <options> <command>

For srun options, please refer to the official documentation.

Below command is the example to exeute a Python script named myscript.py in user's home directory: (Note: This command does not expose GPU, if your program needs GPU, please refer to below section

srun python ~/myscript.py

Description of the command:

python ~/myscript.py: This is the command to be executed on the compute node.

The content of the myscript.py is as below:

import socket
import os
print(str(os.getpid())+": "+socket.gethostname())

The possible output after executing the command is:

2856898: hpcnode1



Submit job to compute node using sbatch

User can submit a job to compute node by using sbatch command. The sbatch only accepts to run batch script (i.e. The first line must be #!).

First you need to create a batch script. (In this example, we have a python script named myscript.py and want to run in compute node)

#!/bin/sh

# You can either configure your job by command line arguments or within your script with line starts with #SBATCH 
#SBATCH --gres=gpu:2g.20gb:1
#SBATCH --exclude=hpcnode1
# This is equivalent to supplying argument "--gres=gpu:2g.20gb:1 --exclude=hpcnode1" when running this script using sbatch

# load the module
module load anaconda3
# print the hostname
hostname
# print the path of python
which python
# run my script
python myscript.py
Save the batch script in any name you like. In this example, we save to batch_script.sh.

Once you have prepared your batch script, you can run the script by sbatch batch_script.sh. sbatch will print out the job ID.

Submitted batch job 263


By default, sbatch will save the output of your script in slurm-<job ID>.out file. You can specific the file name by --output=<filename> argument in your sbatch command. In this example, a file named slurm-263.out can be found in home directory.

Below is the output after running myscript.py:

hpcnode2
/cm/shared/apps/anaconda3/envs/conda-forge/bin/python
429586: hpcnode2

What is the difference between srun and sbatch command?

They are very similar except:

  • srun runs job in interactive mode and your console is blocked even if your job is in pending state. If your console session is disconnected, your job will be terminated immediately. While sbatch schedules and run your job at the background. Your job is still in the queue or running if your console session is terminated.
  • The output when using srun is printed in your console session while the output when using sbatch is stored under your home directory with name slurm-<job ID>.out by default. You can change the file name by --output=<filename> parameter.
  • You can configurate your job in your script when executing sbatch but not in srun.
  • Job array is only supported in sbatch.
  • The command to run in srun can be any executable while sbatch must be a batch script (i.e. The first line must start with #!).



Submit job to a specific node

User will need to make use of the --exclude argument to exclude all undesired nodes in order to run the command to a specific node.

Below is the example to run the hostname command in hpcnode3 by excluding hpcnode1 and hpcnode2. (Note: The current set up of the HPC environment has three compute nodes named hpcnode1, hpcnode2 and hpcnode3)

srun --exclude=hpcnode[1,2] hostname

Description of the command:

--exclude=hpcnode[1,2]: The parameter means requesting SLURM not to schedule the job to hpcnode1 and hpcnode2. If requested resources are not available, the job will be left in a PENDING state.

hostname: This is the command to be executed on the compute node. This command will print the host name of the running process to STDOUT.

The possible output after executing the command is:

hpcnode3



Make GPUs visible in the SLURM job

The GPU resources by default is hidden and unusable from the compute nodes. Users require to declare what GPU resources are needed to run the command by the --gres argument. Below is the example to request three MIG GPUs to the running process.

srun --gres=gpu:2g.20gb:2,gpu:3g.40gb:1 nvidia-smi -L

Description of the command:

--gres=gpu:2g.20gb:2,gpu:3g.40gb:1: The --gres parameter is to request resources for the job. The parameter in this example means requesting SLURM to schedule the job with three specific MIG GPUs, two GPUs with name 2g.20gb and one GPU with name 3g.40gb. The GPU name can be found at Resources & Limits page. The last number indicates the number of GPU of that name is requesting. If requested resources are not available, the job will be left in a PENDING state.

nvidia-smi -L: This is the command to be executed on the compute node. This will list out the visible GPUs in the current job.

The possible output after executing the command is:

GPU 0: NVIDIA A800-SXM4-80GB (UUID: GPU-9ecf31fa-7d95-2b49-7110-15380e9dbf26)
  MIG 2g.20gb     Device  0: (UUID: MIG-2cb8f2b2-7b8a-5b01-b3e2-db178879b0fe)
  MIG 2g.20gb     Device  1: (UUID: MIG-4f616a71-3200-542e-a63d-137bc2a02820)
GPU 1: NVIDIA A800-SXM4-80GB (UUID: GPU-0c8f9650-7bea-3dcc-fcfa-c8960941f242)
  MIG 3g.40gb     Device  0: (UUID: MIG-a53ba460-f9b0-57c4-bd18-3f4c71007600)



Open terminal of compute node or execute commands in interactive mode on SLURM

Sometimes it is more convenient to run multiple commands, like enroot, or Python in interactive mode on compute node. Passing the argument --pty to enable interactive mode when executing srun command. If this argument is missing, your terminal will be halted if the executing program requesting input from STDIN.

Below is the example to open the terminal of a compute node with one MIG GPU visible to the process.

srun --gres=gpu:2g.20gb:1 --pty bash

Description of the command:

--gres=gpu:2g.20gb:1: The --gres parameter is to request resources for the job. The parameter in this example means requesting SLURM to schedule the job with the one MIG GPU with name 2g.20gb. Replace the GPU name with the one you want. The GPU name can be found at Resources & Limits page. The last number indicates the number of GPU of that name is requesting. If requested resources are not available, the job will be left in a PENDING state.

--pty: The parameter means executing command with pseudo terminal mode (interactive mode).

bash: The shell executable. It can be python3 too. In this case, Python will be executed as interactive mode.

The terminal of the compute node will be activated on successful. The below snippet try to execute module load anaconda3 and execute the Python script pytorch-test.py from working directory:

mycompid@hpcnode1:~$ module load anaconda3
mycompid@hpcnode1:~$ python pytorch-test.py
Torch Version: 2.1.2.post300
Is GPU available: True
Number of GPU: 1
GPU Device Name: NVIDIA A800-SXM4-80GB MIG 2g.20gb
mycompid@hpcnode1:~$

To terminate the terminal session, please execute exit command or type Ctrl+D from your keyboard.



Execute commands in a container via SLURM

Below is the example to run a Python script in a NGC TensorFlow image via SLURM.

srun --gres=gpu:2g.20gb:1 --container-image=nvcr.io#nvidia/tensorflow:24.03-tf2-py3 --container-workdir=$HOME python ~/tensorflow-sample.py

Description of the command:

--gres=gpu:2g.20gb:1: The --gres parameter is to request resources for the job. The parameter in this example means requesting SLURM to schedule the job with the one MIG GPU with name 2g.20gb. Replace the GPU name with the one you want. The GPU name can be found at Resources & Limits page. The last number indicates the number of GPU of that name is requesting. If requested resources are not available, the job will be left in a PENDING state.

--container-image=nvcr.io#nvidia/tensorflow:24.03-tf2-py3: The --container-image argument is provided by Pyxis. This parameter tells Pyxis to call enroot to pull the specific TensorFlow image from NGC Catalog. The value is the format accepted by enroot command. Please refer to the enroot documentation for more information.

--container-workdir=$HOME: The --container-image argument is provided by Pyxis. This parameter tell Enroot to set the working directory to the value stored in environment variable named HOME. This is usually the home directory of the logged in user.

python ~/tensorflow-sample.py: This is the command to be executed within the TensorFlow image.

The content of tensorflow-sample.py is as below (extract from TensorFlow quick start guide):

import tensorflow as tf
print("TensorFlow version:", tf.__version__)

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

predictions = model(x_train[:1]).numpy()
predictions

tf.nn.softmax(predictions).numpy()

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
loss_fn(y_train[:1], predictions).numpy()

model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)

model.evaluate(x_test,  y_test, verbose=2)

The possible output after executing the command is:

pyxis: importing docker image: nvcr.io#nvidia/tensorflow:24.03-tf2-py3
pyxis: imported docker image: nvcr.io#nvidia/tensorflow:24.03-tf2-py3
2024-04-30 16:45:59.732472: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-30 16:45:59.732545: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-30 16:45:59.733648: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-30 16:45:59.739967: I tensorflow/core/platform/cpu_feature_guard.cc:183] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-30 16:46:01.845354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 18028 MB memory:  -> device: 0, name: NVIDIA A800-SXM4-80GB MIG 2g.20gb, pci bus id: 0000:47:00.0, compute capability: 8.0
TensorFlow version: 2.15.0
Epoch 1/5
2024-04-30 16:46:03.493151: I external/local_xla/xla/service/service.cc:168] XLA service 0x154ae8657230 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-04-30 16:46:03.493198: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA A800-SXM4-80GB MIG 2g.20gb, Compute Capability 8.0
2024-04-30 16:46:03.498174: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-04-30 16:46:03.534065: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:467] Loaded cuDNN version 90000
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1714466763.619721 4174084 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
1875/1875 [==============================] - 4s 1ms/step - loss: 0.2928 - accuracy: 0.9138
Epoch 2/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1418 - accuracy: 0.9583
Epoch 3/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1073 - accuracy: 0.9670
Epoch 4/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0887 - accuracy: 0.9729
Epoch 5/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0750 - accuracy: 0.9769
313/313 - 1s - loss: 0.0720 - accuracy: 0.9782 - 501ms/epoch - 2ms/step

Please note that the container image layers will be stored in .cache/enroot directory in user home directory. User will need to manually remove this directory to free up disk space.



How can I check my pending and running jobs?

You can get the list of your pending and running jobs by below command:

squeue -u <my COMP ID>
Replace <my COMP ID> with your COMP ID. For example, if your COMP ID is mycompid, then the command becomes squeue -u mycompid.



How can I cancel my running or pending job?

You can run the below command to cancel your job:

scancel <SLURM job ID>
Replace <SLURM job ID> with JOB ID of your job.



How can I get the status of my jobs?

You can list your jobs in the specific period with status by below command:

sacct -S <start date> -E <end date>
For example, if you want to list your jobs in 6 Jun 2024 to 7 Jun 2024, you can execute the command: sacct -S 2024-06-06 -E 2024-06-07.



Using Container (via SSH)

Prepare own container image using Enroot

To prepare a container that can work with SLURM with Pyxis, the tool enroot, which is developed by Nvidia, is recommended. It is very easy to build a customized container for further use. Follow the steps below to create a container that can work with SLURM with Pyxis.

  1. Connect to HPC login node with your COMP ID and password using your favorite SSH clicent. If you are a COMP student, the COMP ID usually is your student ID. This example uses SSH command and login as mycompid.

    ssh mycompid@hpclogin.comp.polyu.edu.hk
    
    Possible output:
    The authenticity of host 'hpclogin.comp.polyu.edu.hk (xx.xx.xx.xx)' can't be established.
    ED25519 key fingerprint is SHA256:WoKEtbRQ2Ci3YUdgQpuo2R6cferYppeyM6LjbW4Qhu8.
    This key is not known by any other names
    Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
    Warning: Permanently added 'hpclogina.comp.polyu.edu.hk' (ED25519) to the list of known hosts.
    mycompid@hpclogina.comp.polyu.edu.hk's password:
    Welcome to Ubuntu 22.04.4 LTS (GNU/Linux 6.5.0-28-generic x86_64)
    
     * Documentation:  https://help.ubuntu.com
     * Management:     https://landscape.canonical.com
     * Support:        https://ubuntu.com/pro
    
    Expanded Security Maintenance for Applications is not enabled.
    
    0 updates can be applied immediately.
    
    9 additional security updates can be applied with ESM Apps.
    Learn more about enabling ESM Apps service at https://ubuntu.com/esm
    
    
    Welcome to Base Command Manager 10.0
    
                                            Based on Ubuntu Jammy Jellyfish 22.04
                                                        Cluster Manager ID: #00000
    
    Use the following commands to adjust your environment:
    
    'module avail'            - show available modules
    'module add <module>'     - adds a module to your environment for this session
    'module initadd <module>' - configure module to be loaded at every login
                                (Note: initadd is available only for Tcl modules)
    
    -------------------------------------------------------------------------------
    
    The programs included with the Ubuntu system are free software;
    the exact distribution terms for each program are described in the
    individual files in /usr/share/doc/*/copyright.
    
    Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
    applicable law.
    
    Last login: Fri Apr 26 15:39:59 2024 from xx.xx.xx.xx
    mycompid@hpclogina:~$
    

  2. Open a terminal to one of the HPC compute nodes.

    srun --gres=gpu:2g.20gb:1 --pty bash
    

  3. Make sure you are in your home directory.

    cd
    

  4. Pull the image from NGC Catalog. (Images from other sources should work but images from NGC Catalog are tested by Nvidia) This example pulls the CUDA image which based on Rocky Linux to nvidia-cuda.sqsh file.

    enroot import -o nvidia-cuda.sqsh docker://nvcr.io#nvidia/cuda:12.4.1-cudnn-devel-rockylinux8
    
    Possible output on finished:
    [INFO] Querying registry for permission grant
    [INFO] Authenticating with user: <anonymous>
    [INFO] Authentication succeeded
    [INFO] Fetching image manifest list
    [INFO] Fetching image manifest
    [INFO] Downloading 12 missing layers...
    
    100% 12:0=0s a1b3e78ec0cae9530a969863cdcc4ec54767944b01ea2d16cdd68e552565ce1e
    
    [INFO] Extracting image layers...
    
    100% 11:0=0s 7ecefaa6bd84a24f90dbe7872f28a94e88520a07941d553579434034d9dca399
    
    [INFO] Converting whiteouts...
    
    100% 11:0=0s 7ecefaa6bd84a24f90dbe7872f28a94e88520a07941d553579434034d9dca399
    
    [INFO] Creating squashfs filesystem...
    
    Parallel mksquashfs: Using 256 processors
    Creating 4.0 filesystem on /home/mycompid/nvidia-cuda.sqsh, block size 131072.
    [===========================================================-] 73063/73063 100%
    
    Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072
            uncompressed data, uncompressed metadata, uncompressed fragments,
            uncompressed xattrs, uncompressed ids
            duplicates are not removed
    Filesystem size 7870665.55 Kbytes (7686.20 Mbytes)
            99.99% of uncompressed filesystem size (7871305.51 Kbytes)
    Inode table size 830843 bytes (811.37 Kbytes)
            100.00% of uncompressed inode table size (830843 bytes)
    Directory table size 453291 bytes (442.67 Kbytes)
            100.00% of uncompressed directory table size (453291 bytes)
    No duplicate files removed
    Number of inodes 16313
    Number of files 12971
    Number of fragments 1218
    Number of symbolic links 1524
    Number of device nodes 0
    Number of fifo nodes 0
    Number of socket nodes 0
    Number of directories 1818
    Number of ids (unique uids + gids) 1
    Number of uids 1
            root (0)
    Number of gids 1
            root (0)
    

  5. Create the container from the pulled image.

    enroot create --name mycontainer nvidia-cuda.sqsh
    
    Possible output on finished:
    [INFO] Extracting squashfs filesystem...
    
    Parallel unsquashfs: Using 256 processors
    16187 inodes (74587 blocks) to write
    
    [===========================================================-] 74587/74587 100%
    
    created 12971 files
    created 1818 directories
    created 1524 symlinks
    created 0 devices
    created 0 fifos
    created 0 sockets
    

  6. Start the container.

    enroot start --root --rw mycontainer
    
    Parameter:

    --root: Work as root in the container.

    --rw: Make the container root filesystem writable.

    Possible output on finished:
    ==========
    == CUDA ==
    ==========
    
    CUDA Version 12.4.1
    
    Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
    
    This container image and its contents are governed by the NVIDIA Deep Learning Container License.
    By pulling and using the container, you accept the terms and conditions of this license:
    https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
    
    A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
    
    bash-4.4#
    

  7. (Optional) Ensure you are working in the container. Since the base OS of the container is different, you can simply print the OS name to verify in this example.

    grep PRETTY /etc/os-release
    
    Possible output:
    PRETTY_NAME="Rocky Linux 8.9 (Green Obsidian)"
    

  8. (Optional) Check the GPUs are visible in the container.

    nvidia-smi -L
    
    Possible output:
    GPU 0: NVIDIA A800-SXM4-80GB (UUID: GPU-d8c65f42-1185-3353-8b74-b92fa2e4d9cd)
    MIG 2g.20gb     Device  0: (UUID: MIG-e7d1f657-9379-5970-b126-7347eee7b3a9)
    

  9. At this stage, you can install any package you want and also make changes to the container. In our example, we are trying to install Python 3.11 and pip using dnf or yum command.

    dnf install -y python3.11 python3.11-pip
    
    Possible output:
    Rocky Linux 8 - AppStream                       8.1 MB/s |  12 MB     00:01
    Rocky Linux 8 - BaseOS                           12 MB/s | 8.2 MB     00:00
    Rocky Linux 8 - Extras                           23 kB/s |  14 kB     00:00
    cuda                                             17 MB/s | 3.3 MB     00:00
    Dependencies resolved.
    ================================================================================
     Package                       Arch     Version               Repository   Size
    ================================================================================
    Installing:
     python3.11                    x86_64   3.11.5-1.el8_9        appstream    29 k
     python3.11-pip                noarch   22.3.1-4.el8_9.1      appstream   2.9 M
    Installing dependencies:
     mpdecimal                     x86_64   2.5.1-3.el8           appstream    92 k
     python3.11-libs               x86_64   3.11.5-1.el8_9        appstream    10 M
     python3.11-pip-wheel          noarch   22.3.1-4.el8_9.1      appstream   1.4 M
     python3.11-setuptools-wheel   noarch   65.5.1-2.el8          appstream   719 k
    Installing weak dependencies:
     python3.11-setuptools         noarch   65.5.1-2.el8          appstream   2.0 M
    
    Transaction Summary
    ================================================================================
    Install  7 Packages
    
    Total download size: 18 M
    Installed size: 67 M
    Downloading Packages:
    (1/7): python3.11-3.11.5-1.el8_9.x86_64.rpm     701 kB/s |  29 kB     00:00
    (2/7): mpdecimal-2.5.1-3.el8.x86_64.rpm         1.7 MB/s |  92 kB     00:00
    (3/7): python3.11-pip-wheel-22.3.1-4.el8_9.1.no  38 MB/s | 1.4 MB     00:00
    (4/7): python3.11-setuptools-65.5.1-2.el8.noarc  13 MB/s | 2.0 MB     00:00
    (5/7): python3.11-setuptools-wheel-65.5.1-2.el8  57 MB/s | 719 kB     00:00
    (6/7): python3.11-pip-22.3.1-4.el8_9.1.noarch.r  13 MB/s | 2.9 MB     00:00
    (7/7): python3.11-libs-3.11.5-1.el8_9.x86_64.rp  12 MB/s |  10 MB     00:00
    --------------------------------------------------------------------------------
    Total                                            12 MB/s |  18 MB     00:01
    Running transaction check
    Transaction check succeeded.
    Running transaction test
    Transaction test succeeded.
    Running transaction
      Preparing        :                                                        1/1
      Installing       : python3.11-setuptools-wheel-65.5.1-2.el8.noarch        1/7
      Installing       : python3.11-pip-wheel-22.3.1-4.el8_9.1.noarch           2/7
      Installing       : mpdecimal-2.5.1-3.el8.x86_64                           3/7
      Installing       : python3.11-3.11.5-1.el8_9.x86_64                       4/7
      Running scriptlet: python3.11-3.11.5-1.el8_9.x86_64                       4/7
      Installing       : python3.11-libs-3.11.5-1.el8_9.x86_64                  5/7
      Installing       : python3.11-setuptools-65.5.1-2.el8.noarch              6/7
      Installing       : python3.11-pip-22.3.1-4.el8_9.1.noarch                 7/7
      Running scriptlet: python3.11-pip-22.3.1-4.el8_9.1.noarch                 7/7
      Verifying        : mpdecimal-2.5.1-3.el8.x86_64                           1/7
      Verifying        : python3.11-3.11.5-1.el8_9.x86_64                       2/7
      Verifying        : python3.11-libs-3.11.5-1.el8_9.x86_64                  3/7
      Verifying        : python3.11-pip-22.3.1-4.el8_9.1.noarch                 4/7
      Verifying        : python3.11-pip-wheel-22.3.1-4.el8_9.1.noarch           5/7
      Verifying        : python3.11-setuptools-65.5.1-2.el8.noarch              6/7
      Verifying        : python3.11-setuptools-wheel-65.5.1-2.el8.noarch        7/7
    
    Installed:
      mpdecimal-2.5.1-3.el8.x86_64
      python3.11-3.11.5-1.el8_9.x86_64
      python3.11-libs-3.11.5-1.el8_9.x86_64
      python3.11-pip-22.3.1-4.el8_9.1.noarch
      python3.11-pip-wheel-22.3.1-4.el8_9.1.noarch
      python3.11-setuptools-65.5.1-2.el8.noarch
      python3.11-setuptools-wheel-65.5.1-2.el8.noarch
    
    Complete!
    

  10. Now verify Python and pip is installed in the container.

    python3 --version && pip3 --version
    
    Possible Output:
    Python 3.11.5
    pip 22.3.1 from /usr/lib/python3.11/site-packages/pip (python 3.11)
    
    You may further install Anaconda 3 / Miniconda, other Python packages. In our example, we will install PyTorch to the container.

  11. Update pip to latest version.

    python3 -m pip install --upgrade pip
    
    Possible output:
    Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
    Requirement already satisfied: pip in /usr/lib/python3.11/site-packages (22.3.1)
        Collecting pip
      Downloading pip-24.0-py3-none-any.whl (2.1 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 127.7 MB/s eta 0:00:00
    Installing collected packages: pip
    Successfully installed pip-24.0
    WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
    

  12. Install PyTorch using pip.
    Current version of PyTorch (2.3.0) requires cuDNN version 8 and CUDA toolkit installed.

    dnf install -y libcudnn8 cuda-toolkit-12-4
    

    Install PyTorch.
    pip3 install torch torchvision torchaudio
    

    Check installed PyTorch version.
    python3 -c "import torch;print(torch.__version__)"
    
    Possible output:
    2.3.0+cu121
    

    Verify PyTorch installation.
    python3 -c "import torch;x = torch.rand(5,3);print(x)"
    
    Possible output:
    tensor([[7.6915e-01, 5.1951e-01, 4.5926e-01],
        [1.1503e-01, 5.6972e-01, 2.0289e-01],
        [5.7117e-01, 6.2400e-04, 5.4877e-01],
        [4.0498e-02, 2.3259e-01, 7.0158e-01],
        [3.1888e-01, 9.1385e-01, 1.0809e-01]])
    

  13. Exit the container if finished.

    exit
    

  14. Be sure export the container before leaving SLURM session. Not doing this will make your work removed once the SLURM session exited.

    enroot export -o mycontainer.sqsh mycontainer
    
    Possible output on finished:
    [INFO] Creating squashfs filesystem...
    
    Parallel mksquashfs: Using 256 processors
    Creating 4.0 filesystem on /home/mycompid/mycontainer.sqsh, block size 131072.
    [=========================================================|] 160588/160588 100%
    
    Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072
            uncompressed data, uncompressed metadata, uncompressed fragments,
            uncompressed xattrs, uncompressed ids
            duplicates are not removed
    Filesystem size 15547518.67 Kbytes (15183.12 Mbytes)
            100.00% of uncompressed filesystem size (15548159.75 Kbytes)
    Inode table size 2220814 bytes (2168.76 Kbytes)
            100.00% of uncompressed inode table size (2220814 bytes)
    Directory table size 1470358 bytes (1435.90 Kbytes)
            100.00% of uncompressed directory table size (1470358 bytes)
    Xattr table size 2007 bytes (1.96 Kbytes)
            102.40% of uncompressed xattr table size (1960 bytes)
    No duplicate files removed
    Number of inodes 50683
    Number of files 43503
    Number of fragments 3882
    Number of symbolic links 2250
    Number of device nodes 0
    Number of fifo nodes 0
    Number of socket nodes 0
    Number of directories 4930
    Number of ids (unique uids + gids) 1
    Number of uids 1
            root (0)
    Number of gids 1
            root (0)
    

    Verify the container is exported.
    ls -l mycontainer.sqsh
    
    Possible output:
    -rw-r--r-- 1 mycompid mygroup 15920660480 Apr 30 15:42 mycontainer.sqsh
    

  15. Exit SLURM session.

    exit
    



Execute command in the container image prepared by Enroot command

Below is the example to run a Python script in previous created container image via SLURM. SLURM must be used to expose GPUs resources to the process.

srun --gres=gpu:2g.20gb:1 --container-image=./mycontainer.sqsh --container-workdir=$HOME python3 ~/pytorch-sample.py

Description of the command:

--gres=gpu:2g.20gb:1: The parameter means requesting SLURM to schedule the job with the one MIG GPU with name 2g.20gb. If requested resources are not available, the job will be left in a PENDING state.

--container-image=./mycontainer.sqsh: Tell enroot to load the image from local mycontainer.sqsh file. Please note that the slash ./ is neccessary when referring file in current working directory because enroot will only look for the image locally when seeing the slash character (/).

--container-workdir=$HOME: The --container-image argument is provided by Pyxis. This parameter tell Enroot to set the working directory to the value stored in environment variable named HOME. This is usually the home directory of the logged in user.

python3 ~/pytorch-sample.py: This is the command to be executed within the image.


The content of pytorch-sample.py is as below (extract from PyTorch quick start guide):

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

# Download training data from open datasets.
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
)

# Download test data from open datasets.
test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor(),
)

batch_size = 64

# Create data loaders.
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

for X, y in test_dataloader:
    print(f"Shape of X [N, C, H, W]: {X.shape}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break

# Get cpu, gpu or mps device for training.
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)
print(model)
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

# Download training data from open datasets.
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
)

# Download test data from open datasets.
test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor(),
)

batch_size = 64

# Create data loaders.
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

for X, y in test_dataloader:
    print(f"Shape of X [N, C, H, W]: {X.shape}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break

# Get cpu, gpu or mps device for training.
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)
print(model)


print("Training Model")
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

epochs = 5
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train(train_dataloader, model, loss_fn, optimizer)
    test(test_dataloader, model, loss_fn)
print("Done!")


classes = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot",
]

model.eval()
x, y = test_data[0][0], test_data[0][1]
with torch.no_grad():
    x = x.to(device)
    pred = model(x)
    predicted, actual = classes[pred[0].argmax(0)], classes[y]
    print(f'Predicted: "{predicted}", Actual: "{actual}"')


classes = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot",
]

model.eval()
x, y = test_data[0][0], test_data[0][1]
with torch.no_grad():
    x = x.to(device)
    pred = model(x)
    predicted, actual = classes[pred[0].argmax(0)], classes[y]
    print(f'Predicted: "{predicted}", Actual: "{actual}"')

The possible output after executing the command is:

Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64
Using cuda device
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)
Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64
Using cuda device
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)
Training Model
Epoch 1
-------------------------------
loss: 2.311987  [   64/60000]
loss: 2.296749  [ 6464/60000]
loss: 2.278760  [12864/60000]
loss: 2.267926  [19264/60000]
loss: 2.258412  [25664/60000]
loss: 2.225776  [32064/60000]
loss: 2.235035  [38464/60000]
loss: 2.198730  [44864/60000]
loss: 2.192165  [51264/60000]
loss: 2.163504  [57664/60000]
Test Error:
 Accuracy: 45.5%, Avg loss: 2.160435

Epoch 2
-------------------------------
loss: 2.174705  [   64/60000]
loss: 2.159503  [ 6464/60000]
loss: 2.106915  [12864/60000]
loss: 2.117778  [19264/60000]
loss: 2.075339  [25664/60000]
loss: 2.014432  [32064/60000]
loss: 2.040038  [38464/60000]
loss: 1.959843  [44864/60000]
loss: 1.951941  [51264/60000]
loss: 1.890565  [57664/60000]
Test Error:
 Accuracy: 59.4%, Avg loss: 1.887667

Epoch 3
-------------------------------
loss: 1.928848  [   64/60000]
loss: 1.885536  [ 6464/60000]
loss: 1.776609  [12864/60000]
loss: 1.809799  [19264/60000]
loss: 1.709673  [25664/60000]
loss: 1.662814  [32064/60000]
loss: 1.678832  [38464/60000]
loss: 1.583767  [44864/60000]
loss: 1.592498  [51264/60000]
loss: 1.495840  [57664/60000]
Test Error:
 Accuracy: 61.4%, Avg loss: 1.516900

Epoch 4
-------------------------------
loss: 1.592645  [   64/60000]
loss: 1.541337  [ 6464/60000]
loss: 1.404812  [12864/60000]
loss: 1.473237  [19264/60000]
loss: 1.354290  [25664/60000]
loss: 1.348429  [32064/60000]
loss: 1.361446  [38464/60000]
loss: 1.291682  [44864/60000]
loss: 1.316547  [51264/60000]
loss: 1.219320  [57664/60000]
Test Error:
 Accuracy: 63.4%, Avg loss: 1.250891

Epoch 5
-------------------------------
loss: 1.336257  [   64/60000]
loss: 1.299178  [ 6464/60000]
loss: 1.149520  [12864/60000]
loss: 1.252716  [19264/60000]
loss: 1.123262  [25664/60000]
loss: 1.144166  [32064/60000]
loss: 1.165725  [38464/60000]
loss: 1.110545  [44864/60000]
loss: 1.144198  [51264/60000]
loss: 1.057242  [57664/60000]
Test Error:
 Accuracy: 64.7%, Avg loss: 1.084700

Done!
Predicted: "Ankle boot", Actual: "Ankle boot"
Predicted: "Ankle boot", Actual: "Ankle boot"



Work with pre-pulled Apptainer (Singularity) images

Some images are pre-pulled and converted to Apptainer image and stored under /container-image with file extension .sif.

You can list the Apptainer images from the login node using below command:

ls -l /container-image/*.sif

You can use the apptainer command to work with the Apptainer image.

  1. First open a terminal to a compute node. (More info)

    srun --gres=gpu:2g.20gb:1 --pty bash
    

  2. Load Apptainer module:

    module load apptainer
    

  3. Open the terminal of Apptainer image. In this example, we are working with /container-image/tensorflow:23.11-tf2-py3.sif

    apptainer run --nv /container-image/tensorflow:23.11-tf2-py3.sif
    
    The --nv is the parameter to enable NVIDIA GPUs in Apptainer.

    Once the image loaded, you will reach a line like:
    Apptainer>
    

  4. You are now in the terminal of the image. Your home directory is mapped automatically. You can execute your program now. For example:

    python ~/tensorflow-sample.py
    

  5. To exit the terminal of the image. Either execute exit or type Ctrl+D.



Using Anaconda 3 (via SSH)

Work with pre-installed Anaconda 3

Anaconda 3 are pre-installed in the server. The TensorFlow, PyTorch and SciPy (from conda-forge channel) are installed to conda-forge environment.

To work with Anaconda 3:

  1. Be sure you are in the terminal of a compute node (More info)

    srun --gres=gpu:2g.20gb:1 --pty bash
    

  2. Load Anaconda 3 module:

    module load anaconda3
    
    The conda-forge environment will be activated by default. You can check the conda environment by command conda info.

  3. You can check what packages and what version are installed by this command:

    conda list
    



Create own conda environment and install own packages to home directory

The default installed packages may not meet your needs (e.g. You may want to use other Python version or install additional packages). You can create your own conda environment to install different Python version and any other packages.

  1. Check conda command is available in your session:

    conda --version
    

  2. If the command conda is not found in your PATH, be sure loading Anaconda 3 in your session:

    module load anaconda3
    

  3. Create your own environment. We named it as myenvin the example.

    conda create -n myenv -c conda-forge python=3.10
    

    This will create a conda environment named myenv with Python 3.10 installed from channel conda-forge. Installing packages from conda-forge channel is recommended because it should have more up-to-date packages. The environment will be located at /home/mycompid/.conda/envs/myenv.

    The parameter:

    -n myenv: The name of the target environment.

    -c conda-forge: Install packages from specific channel.

    python=3.10: Packages to be installed into the new environment.

  4. Activate your environment:
    If you have not execute conda init before, you are required to execute this command first. The command will modify your login script(s) to load the Anaconda 3 base environment every time you logged in to the HPC servers. To reverse this change, you can execute command conda init --reverse.

    For everytime you logged into login node, please source your login script to set up your Anaconda 3 environment:

    source ~/.bashrc
    

    The below command is to activate your own environment (with environment name: myenv).
    conda activate myenv
    
    You can check the active environment by conda info command.

  5. To install packages using conda from conda-forge channel, you can execute command like conda install -c conda-forge <package name>. For example, the command to install pytorch tensorflow with CUDA version 12.2 into my own environment is:

    conda install -c conda-forge pytorch=*=cuda120* tensorflow=*=cuda120* torchvision=*=cuda120* cuda-version=12.2
    
    Once finished, you can verify the installation by conda list pytorch command. You can uninstall the package using this command conda uninstall pytorch.

  6. To instlal packages using pip, you can execute command like pip install <package name>. For example, the command to install openai 1.23 into my own environment is:

    pip install openai==1.23
    
    If you have an active conda environment, pip will install packages into it. So you can also verify the installation by conda list openai command too. You can uninstall the package using this command pip uninstall openai. This will only uninstall the specific package. All dependencies will not be uninstall automatically. You will have to uninstall them manually if needed.

  7. (Optional) If you want to use the environment in JupyterHub, please install cm-jupyter-eg-kernel-wlm using pip in your environment:

    pip install cm-jupyter-eg-kernel-wlm
    

  8. To deactivate your own environment, you can execute command:

    conda deactivate
    
    You will go back to previous environment. You can check your current environment by conda info command. You can re-activate the environment by executing conda activate myenv again.

Additional note: If you want to remove your own environment you can execute below command (for example removing environment with name myenv):

conda deactivate
conda env remove --name myenv



Using JupyterHub

What is kernel and what is kernel template?

Kernel is used to run your code on server in Jupyter Notebook. Kernel template is used to help user to create the kernel to be used in Jupyper Notebook.

Access and login JupyterHub

  1. Access JupyterHub by this link https://hpclogin.comp.polyu.edu.hk:8000/.
  2. Enter your COMP ID and password to the appropriate text box. Click "Sign in" button to login.
    JupyterHub

For details about JupyterHub interface, please refer to JupyterHub official documentation.



Create a Jupyter Notebook that work with SLURM

  1. Create a kernel from kernel template
    Note: If you have created kernel that works with SLURM, you do not need to re-create another one as long as resources requirement is the same. Create kernel from kernel template

    1. Click the icon at the left side bar.
    2. Click the next to the kernel template. Here we will use "Python 3.9 via SLURM" as an example.
      Create kernel from kernel template
    3. Select the GPU resource to be used in this kernel. Please be reminded that the resource cannot be changed after kernel creation. If you want to use other resource, you will need to create another kernel. The MIG profile has a limitation that only one GPU resource can be used in a process.
    4. Change the display name to the one that you can identify the kernel.
      Create kernel from kernel template
    5. You can see the newly created kernel at shown in the image.
  2. Once you created a kernel, some new icons linked to that kernel is shown in Launcher. Go to Launcher and click the kernel icon under "Notebook".
    Create Notebook

  3. The Jupyter Notebook is created. Please verify the status of the kernel at the top right corner. If it is not the right kernel or the kernel status icon is not a hollow circle . Please select the proper kernel and try again. Otherwise, your code cannot be run at compute node.
    Create Notebook
    If everything is fine, you can start coding in the Jupyter notebook.

  4. You can, at any time, change the notebook kernel by clicking the active kernel name at the top-right corner of your notebook.

Note: If the message "Error from Gateway: [Timeout during request] Exception while attempting to connect to Gateway server url 'https://localhost:8890'. Ensure gateway url is valid and the Gateway instance is running." was prompted when you selecting the kernel, usually because the resources you are requesting cannot be allocated. When you switch a kernel via SLURM in Jupyter Notebook, a new job will be scheduled in SLURM. The SLURM job will not be cancelled automatically even if the kernel is timed-out. In this case, you will need to cancel it manually.
JupyterHub
To cancel the kernel job:
1. Open a terminal.
2. Execute module load slurm
3. Execute squeue -u <my COMP ID> (Replace <my COMP ID> with your COMP ID) to get a list of active jobs. Locate the JOBID of the pending job ("PD" under "ST" column and "(QOSMaxGRESPerJob)" under "NODELIST(REASON)" column).
JupyterHub
4. Execute scancel <JOB ID> (Replace <JOB ID> with the value you get in point 3 above).
5. The job should be cancelled. You can verify by executing squeue -u <my COMP ID> again.



How to use Anaconda3 in JupyterHub?

  1. (First time only) Connect to the HPC server via SSH (Reference) or open a terminal in JupyterHub (Reference).
  2. (First time only) Load Anaconda 3 module by command: module load anaconda3 or install your own Anaconda 3 environment.
  3. (First time only) Execute conda init
  4. Login JupyterHub (Reference)
  5. Create a kernel from kernel template named "Python Conda kernel on SLURM" with proper resource. (Reference)
  6. In the "Environment to use" field, select the anaconda environment to be used in the kernel.
    Select Anaconda environment
  7. Create a new Jupyter notebook using the kernel created from "Python Conda kernel on SLURM" (Reference) or switch the kernel in existing Jupyter notebook.
    Change Kernel
  8. Please verify the status of the kernel at the top right corner. If it is not the right kernel or the kernel status icon is not a hollow circle . Please select the proper kernel and try again. Otherwise, your code cannot be run at compute node.
    If everything is fine, you can start coding in the Jupyter notebook.

Note

The module cm-jupyter-eg-kernel-wlm must be installed in your environment. If it is not installed, please install it using pip in your environment:

pip install cm-jupyter-eg-kernel-wlm



Open a terminal in JupyterHub

  1. Create a new "Launcher".
  2. Click "Terminal" icon.
    Open a terminal in JupyterHub
  3. A terminal is opened and can execute commands like managing conda environment.
    A shell is opened

What kernels and kernal templates are available?

The below kernels and kernel templates are available in the server.

Name Kernel or Kernel Template Description
Python 3 Kernel Your code will run on the login node. This is not recommended.
Python 3.9 via SLURM Kernel Template Your code will run on the compute node if the resource required is available. The system Python version 3.9 will be used.
Python Conda kernel on SLURM Kernel Template Your code will run on the compute node if the resource required is available. The anaconda environment you selected during kernel creation will be used.

Note: If this item is not visible in your interface, please execute conda init in terminal and reload JupyterHub.