HOWTOs¶
Note on Remote Access
If you are using PolyU Wi-Fi or remote network, please make sure your device is connected to the Staff VPN (Staff) or Research VPN (Student) before connecting to login node.
Information of Staff VPN can be referred at here and Research VPN can be referred at here
Please also check this page for the requirement of your system when connecting to the Staff VPN or Research VPN. Usually, the latest security patches of the system and a supported anti-virus software with the latest definition updates are required.
Note
This page is applied to Nvidia platform only. For Huawei platform, please visit the relevant course in Blackboard (under course "COMP_APULIS_AI_20240 Apulis AI Studio").
Table of contents
- Connecting to the Login Node
- Using SLURM (via SSH)
- Using Container (via SSH)
- Using Anaconda 3 (via SSH)
- Using JupyterHub
Connecting to the Login Node¶
- Use your favorite SSH client program, connecting to host
hpclogin.comp.polyu.edu.hk
. - Enter your COMP ID and password. If you have not activated it or forgot your password, please activate it or reset password at https://acct.comp.polyu.edu.hk.
- You are now logged in the login node.
Using SLURM (via SSH)¶
Load SLURM module¶
The SLURM module must be loaded before using SLURM.
To check SLURM is loaded in the current session, you can execute below command:
If SLURM is loaded, you will see an entry like slurm/slurm/23.02.7
. The possible output is:
If SLURM module is not loaded in the current session, you can execute below command to load SLURM:
Submit job to compute node using srun
¶
User can submit a job to compute node by using srun
command. The srun
command will block your terminal and execute your desired command on compute node when resources are available. If resources are not available, the job will be in PENDING state. SLURM will put your job to the job queue. You can type Ctrl + C from your keyboard to cancel the job.
The syntax of the command is:
For srun
options, please refer to the official documentation.
Below command is the example to exeute a Python script named myscript.py
in user's home directory: (Note: This command does not expose GPU, if your program needs GPU, please refer to below section
Description of the command:
python ~/myscript.py
: This is the command to be executed on the compute node.
The content of the myscript.py
is as below:
The possible output after executing the command is:
Submit job to compute node using sbatch
¶
User can submit a job to compute node by using sbatch
command. The sbatch
only accepts to run batch script (i.e. The first line must be #!
).
First you need to create a batch script. (In this example, we have a python script named myscript.py
and want to run in compute node)
#!/bin/sh
# You can either configure your job by command line arguments or within your script with line starts with #SBATCH
#SBATCH --gres=gpu:2g.20gb:1
#SBATCH --exclude=hpcnode1
# This is equivalent to supplying argument "--gres=gpu:2g.20gb:1 --exclude=hpcnode1" when running this script using sbatch
# load the module
module load anaconda3
# print the hostname
hostname
# print the path of python
which python
# run my script
python myscript.py
batch_script.sh
.
Once you have prepared your batch script, you can run the script by sbatch batch_script.sh
. sbatch
will print out the job ID.
By default, sbatch
will save the output of your script in slurm-<job ID>.out
file. You can specific the file name by --output=<filename>
argument in your sbatch
command. In this example, a file named slurm-263.out
can be found in home directory.
Below is the output after running myscript.py
:
What is the difference between srun
and sbatch
command?
They are very similar except:
srun
runs job in interactive mode and your console is blocked even if your job is in pending state. If your console session is disconnected, your job will be terminated immediately. Whilesbatch
schedules and run your job at the background. Your job is still in the queue or running if your console session is terminated.- The output when using
srun
is printed in your console session while the output when usingsbatch
is stored under your home directory with nameslurm-<job ID>.out
by default. You can change the file name by--output=<filename>
parameter. - You can configurate your job in your script when executing
sbatch
but not insrun
. - Job array is only supported in
sbatch
. - The command to run in
srun
can be any executable whilesbatch
must be a batch script (i.e. The first line must start with#!
).
Submit job to a specific node¶
User will need to make use of the --exclude
argument to exclude all undesired nodes in order to run the command to a specific node.
Below is the example to run the hostname
command in hpcnode3
by excluding hpcnode1
and hpcnode2
. (Note: The current set up of the HPC environment has three compute nodes named hpcnode1
, hpcnode2
and hpcnode3
)
Description of the command:
--exclude=hpcnode[1,2]
: The parameter means requesting SLURM not to schedule the job to hpcnode1
and hpcnode2
. If requested resources are not available, the job will be left in a PENDING state.
hostname
: This is the command to be executed on the compute node. This command will print the host name of the running process to STDOUT.
The possible output after executing the command is:
Make GPUs visible in the SLURM job¶
The GPU resources by default is hidden and unusable from the compute nodes. Users require to declare what GPU resources are needed to run the command by the --gres
argument. Below is the example to request three MIG GPUs to the running process.
Description of the command:
--gres=gpu:2g.20gb:2,gpu:3g.40gb:1
: The --gres
parameter is to request resources for the job. The parameter in this example means requesting SLURM to schedule the job with three specific MIG GPUs, two GPUs with name 2g.20gb
and one GPU with name 3g.40gb
. The GPU name can be found at Resources & Limits page. The last number indicates the number of GPU of that name is requesting. If requested resources are not available, the job will be left in a PENDING state.
nvidia-smi -L
: This is the command to be executed on the compute node. This will list out the visible GPUs in the current job.
The possible output after executing the command is:
GPU 0: NVIDIA A800-SXM4-80GB (UUID: GPU-9ecf31fa-7d95-2b49-7110-15380e9dbf26)
MIG 2g.20gb Device 0: (UUID: MIG-2cb8f2b2-7b8a-5b01-b3e2-db178879b0fe)
MIG 2g.20gb Device 1: (UUID: MIG-4f616a71-3200-542e-a63d-137bc2a02820)
GPU 1: NVIDIA A800-SXM4-80GB (UUID: GPU-0c8f9650-7bea-3dcc-fcfa-c8960941f242)
MIG 3g.40gb Device 0: (UUID: MIG-a53ba460-f9b0-57c4-bd18-3f4c71007600)
Open terminal of compute node or execute commands in interactive mode on SLURM¶
Sometimes it is more convenient to run multiple commands, like enroot
, or Python in interactive mode on compute node. Passing the argument --pty
to enable interactive mode when executing srun
command. If this argument is missing, your terminal will be halted if the executing program requesting input from STDIN.
Below is the example to open the terminal of a compute node with one MIG GPU visible to the process.
Description of the command:
--gres=gpu:2g.20gb:1
: The --gres
parameter is to request resources for the job. The parameter in this example means requesting SLURM to schedule the job with the one MIG GPU with name 2g.20gb
. Replace the GPU name with the one you want. The GPU name can be found at Resources & Limits page. The last number indicates the number of GPU of that name is requesting. If requested resources are not available, the job will be left in a PENDING state.
--pty
: The parameter means executing command with pseudo terminal mode (interactive mode).
bash
: The shell executable. It can be python3
too. In this case, Python will be executed as interactive mode.
The terminal of the compute node will be activated on successful. The below snippet try to execute module load anaconda3
and execute the Python script pytorch-test.py
from working directory:
mycompid@hpcnode1:~$ module load anaconda3
mycompid@hpcnode1:~$ python pytorch-test.py
Torch Version: 2.1.2.post300
Is GPU available: True
Number of GPU: 1
GPU Device Name: NVIDIA A800-SXM4-80GB MIG 2g.20gb
mycompid@hpcnode1:~$
To terminate the terminal session, please execute exit
command or type Ctrl+D from your keyboard.
Execute commands in a container via SLURM¶
Below is the example to run a Python script in a NGC TensorFlow image via SLURM.
srun --gres=gpu:2g.20gb:1 --container-image=nvcr.io#nvidia/tensorflow:24.03-tf2-py3 --container-workdir=$HOME python ~/tensorflow-sample.py
Description of the command:
--gres=gpu:2g.20gb:1
: The --gres
parameter is to request resources for the job. The parameter in this example means requesting SLURM to schedule the job with the one MIG GPU with name 2g.20gb
. Replace the GPU name with the one you want. The GPU name can be found at Resources & Limits page. The last number indicates the number of GPU of that name is requesting. If requested resources are not available, the job will be left in a PENDING state.
--container-image=nvcr.io#nvidia/tensorflow:24.03-tf2-py3
: The --container-image
argument is provided by Pyxis. This parameter tells Pyxis to call enroot
to pull the specific TensorFlow image from NGC Catalog. The value is the format accepted by enroot
command. Please refer to the enroot
documentation for more information.
--container-workdir=$HOME
: The --container-image
argument is provided by Pyxis. This parameter tell Enroot to set the working directory to the value stored in environment variable named HOME
. This is usually the home directory of the logged in user.
python ~/tensorflow-sample.py
: This is the command to be executed within the TensorFlow image.
The content of tensorflow-sample.py
is as below (extract from TensorFlow quick start guide):
import tensorflow as tf
print("TensorFlow version:", tf.__version__)
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
predictions = model(x_train[:1]).numpy()
predictions
tf.nn.softmax(predictions).numpy()
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
loss_fn(y_train[:1], predictions).numpy()
model.compile(optimizer='adam',
loss=loss_fn,
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test, verbose=2)
The possible output after executing the command is:
pyxis: importing docker image: nvcr.io#nvidia/tensorflow:24.03-tf2-py3
pyxis: imported docker image: nvcr.io#nvidia/tensorflow:24.03-tf2-py3
2024-04-30 16:45:59.732472: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-30 16:45:59.732545: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-30 16:45:59.733648: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-30 16:45:59.739967: I tensorflow/core/platform/cpu_feature_guard.cc:183] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-30 16:46:01.845354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 18028 MB memory: -> device: 0, name: NVIDIA A800-SXM4-80GB MIG 2g.20gb, pci bus id: 0000:47:00.0, compute capability: 8.0
TensorFlow version: 2.15.0
Epoch 1/5
2024-04-30 16:46:03.493151: I external/local_xla/xla/service/service.cc:168] XLA service 0x154ae8657230 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-04-30 16:46:03.493198: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA A800-SXM4-80GB MIG 2g.20gb, Compute Capability 8.0
2024-04-30 16:46:03.498174: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-04-30 16:46:03.534065: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:467] Loaded cuDNN version 90000
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1714466763.619721 4174084 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
1875/1875 [==============================] - 4s 1ms/step - loss: 0.2928 - accuracy: 0.9138
Epoch 2/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1418 - accuracy: 0.9583
Epoch 3/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1073 - accuracy: 0.9670
Epoch 4/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0887 - accuracy: 0.9729
Epoch 5/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0750 - accuracy: 0.9769
313/313 - 1s - loss: 0.0720 - accuracy: 0.9782 - 501ms/epoch - 2ms/step
Please note that the container image layers will be stored in
.cache/enroot
directory in user home directory. User will need to manually remove this directory to free up disk space.
How can I check my pending and running jobs?¶
You can get the list of your pending and running jobs by below command:
Replace<my COMP ID>
with your COMP ID. For example, if your COMP ID is mycompid
, then the command becomes squeue -u mycompid
.
How can I cancel my running or pending job?¶
You can run the below command to cancel your job:
Replace<SLURM job ID>
with JOB ID of your job.
How can I get the status of my jobs?¶
You can list your jobs in the specific period with status by below command:
For example, if you want to list your jobs in 6 Jun 2024 to 7 Jun 2024, you can execute the command:sacct -S 2024-06-06 -E 2024-06-07
.
Using Container (via SSH)¶
Prepare own container image using Enroot¶
To prepare a container that can work with SLURM with Pyxis, the tool enroot
, which is developed by Nvidia, is recommended. It is very easy to build a customized container for further use. Follow the steps below to create a container that can work with SLURM with Pyxis.
-
Connect to HPC login node with your COMP ID and password using your favorite SSH clicent. If you are a COMP student, the COMP ID usually is your student ID. This example uses SSH command and login as
Possible output:mycompid
.
The authenticity of host 'hpclogin.comp.polyu.edu.hk (xx.xx.xx.xx)' can't be established. ED25519 key fingerprint is SHA256:WoKEtbRQ2Ci3YUdgQpuo2R6cferYppeyM6LjbW4Qhu8. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added 'hpclogina.comp.polyu.edu.hk' (ED25519) to the list of known hosts. mycompid@hpclogina.comp.polyu.edu.hk's password: Welcome to Ubuntu 22.04.4 LTS (GNU/Linux 6.5.0-28-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/pro Expanded Security Maintenance for Applications is not enabled. 0 updates can be applied immediately. 9 additional security updates can be applied with ESM Apps. Learn more about enabling ESM Apps service at https://ubuntu.com/esm Welcome to Base Command Manager 10.0 Based on Ubuntu Jammy Jellyfish 22.04 Cluster Manager ID: #00000 Use the following commands to adjust your environment: 'module avail' - show available modules 'module add <module>' - adds a module to your environment for this session 'module initadd <module>' - configure module to be loaded at every login (Note: initadd is available only for Tcl modules) ------------------------------------------------------------------------------- The programs included with the Ubuntu system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. Last login: Fri Apr 26 15:39:59 2024 from xx.xx.xx.xx mycompid@hpclogina:~$
-
Open a terminal to one of the HPC compute nodes.
-
Make sure you are in your home directory.
-
Pull the image from NGC Catalog. (Images from other sources should work but images from NGC Catalog are tested by Nvidia) This example pulls the CUDA image which based on Rocky Linux to
Possible output on finished:nvidia-cuda.sqsh
file.
[INFO] Querying registry for permission grant [INFO] Authenticating with user: <anonymous> [INFO] Authentication succeeded [INFO] Fetching image manifest list [INFO] Fetching image manifest [INFO] Downloading 12 missing layers... 100% 12:0=0s a1b3e78ec0cae9530a969863cdcc4ec54767944b01ea2d16cdd68e552565ce1e [INFO] Extracting image layers... 100% 11:0=0s 7ecefaa6bd84a24f90dbe7872f28a94e88520a07941d553579434034d9dca399 [INFO] Converting whiteouts... 100% 11:0=0s 7ecefaa6bd84a24f90dbe7872f28a94e88520a07941d553579434034d9dca399 [INFO] Creating squashfs filesystem... Parallel mksquashfs: Using 256 processors Creating 4.0 filesystem on /home/mycompid/nvidia-cuda.sqsh, block size 131072. [===========================================================-] 73063/73063 100% Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072 uncompressed data, uncompressed metadata, uncompressed fragments, uncompressed xattrs, uncompressed ids duplicates are not removed Filesystem size 7870665.55 Kbytes (7686.20 Mbytes) 99.99% of uncompressed filesystem size (7871305.51 Kbytes) Inode table size 830843 bytes (811.37 Kbytes) 100.00% of uncompressed inode table size (830843 bytes) Directory table size 453291 bytes (442.67 Kbytes) 100.00% of uncompressed directory table size (453291 bytes) No duplicate files removed Number of inodes 16313 Number of files 12971 Number of fragments 1218 Number of symbolic links 1524 Number of device nodes 0 Number of fifo nodes 0 Number of socket nodes 0 Number of directories 1818 Number of ids (unique uids + gids) 1 Number of uids 1 root (0) Number of gids 1 root (0)
-
Create the container from the pulled image.
Possible output on finished:
[INFO] Extracting squashfs filesystem... Parallel unsquashfs: Using 256 processors 16187 inodes (74587 blocks) to write [===========================================================-] 74587/74587 100% created 12971 files created 1818 directories created 1524 symlinks created 0 devices created 0 fifos created 0 sockets
-
Start the container.
Parameter:
--root
: Work as root in the container.
--rw
: Make the container root filesystem writable.
Possible output on finished:
========== == CUDA == ========== CUDA Version 12.4.1 Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience. bash-4.4#
-
(Optional) Ensure you are working in the container. Since the base OS of the container is different, you can simply print the OS name to verify in this example.
Possible output:
-
(Optional) Check the GPUs are visible in the container.
Possible output:
-
At this stage, you can install any package you want and also make changes to the container. In our example, we are trying to install Python 3.11 and pip using
Possible output:dnf
oryum
command.
Rocky Linux 8 - AppStream 8.1 MB/s | 12 MB 00:01 Rocky Linux 8 - BaseOS 12 MB/s | 8.2 MB 00:00 Rocky Linux 8 - Extras 23 kB/s | 14 kB 00:00 cuda 17 MB/s | 3.3 MB 00:00 Dependencies resolved. ================================================================================ Package Arch Version Repository Size ================================================================================ Installing: python3.11 x86_64 3.11.5-1.el8_9 appstream 29 k python3.11-pip noarch 22.3.1-4.el8_9.1 appstream 2.9 M Installing dependencies: mpdecimal x86_64 2.5.1-3.el8 appstream 92 k python3.11-libs x86_64 3.11.5-1.el8_9 appstream 10 M python3.11-pip-wheel noarch 22.3.1-4.el8_9.1 appstream 1.4 M python3.11-setuptools-wheel noarch 65.5.1-2.el8 appstream 719 k Installing weak dependencies: python3.11-setuptools noarch 65.5.1-2.el8 appstream 2.0 M Transaction Summary ================================================================================ Install 7 Packages Total download size: 18 M Installed size: 67 M Downloading Packages: (1/7): python3.11-3.11.5-1.el8_9.x86_64.rpm 701 kB/s | 29 kB 00:00 (2/7): mpdecimal-2.5.1-3.el8.x86_64.rpm 1.7 MB/s | 92 kB 00:00 (3/7): python3.11-pip-wheel-22.3.1-4.el8_9.1.no 38 MB/s | 1.4 MB 00:00 (4/7): python3.11-setuptools-65.5.1-2.el8.noarc 13 MB/s | 2.0 MB 00:00 (5/7): python3.11-setuptools-wheel-65.5.1-2.el8 57 MB/s | 719 kB 00:00 (6/7): python3.11-pip-22.3.1-4.el8_9.1.noarch.r 13 MB/s | 2.9 MB 00:00 (7/7): python3.11-libs-3.11.5-1.el8_9.x86_64.rp 12 MB/s | 10 MB 00:00 -------------------------------------------------------------------------------- Total 12 MB/s | 18 MB 00:01 Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : python3.11-setuptools-wheel-65.5.1-2.el8.noarch 1/7 Installing : python3.11-pip-wheel-22.3.1-4.el8_9.1.noarch 2/7 Installing : mpdecimal-2.5.1-3.el8.x86_64 3/7 Installing : python3.11-3.11.5-1.el8_9.x86_64 4/7 Running scriptlet: python3.11-3.11.5-1.el8_9.x86_64 4/7 Installing : python3.11-libs-3.11.5-1.el8_9.x86_64 5/7 Installing : python3.11-setuptools-65.5.1-2.el8.noarch 6/7 Installing : python3.11-pip-22.3.1-4.el8_9.1.noarch 7/7 Running scriptlet: python3.11-pip-22.3.1-4.el8_9.1.noarch 7/7 Verifying : mpdecimal-2.5.1-3.el8.x86_64 1/7 Verifying : python3.11-3.11.5-1.el8_9.x86_64 2/7 Verifying : python3.11-libs-3.11.5-1.el8_9.x86_64 3/7 Verifying : python3.11-pip-22.3.1-4.el8_9.1.noarch 4/7 Verifying : python3.11-pip-wheel-22.3.1-4.el8_9.1.noarch 5/7 Verifying : python3.11-setuptools-65.5.1-2.el8.noarch 6/7 Verifying : python3.11-setuptools-wheel-65.5.1-2.el8.noarch 7/7 Installed: mpdecimal-2.5.1-3.el8.x86_64 python3.11-3.11.5-1.el8_9.x86_64 python3.11-libs-3.11.5-1.el8_9.x86_64 python3.11-pip-22.3.1-4.el8_9.1.noarch python3.11-pip-wheel-22.3.1-4.el8_9.1.noarch python3.11-setuptools-65.5.1-2.el8.noarch python3.11-setuptools-wheel-65.5.1-2.el8.noarch Complete!
-
Now verify Python and pip is installed in the container.
Possible Output:
You may further install Anaconda 3 / Miniconda, other Python packages. In our example, we will install PyTorch to the container. -
Update pip to latest version.
Possible output:
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Requirement already satisfied: pip in /usr/lib/python3.11/site-packages (22.3.1) Collecting pip Downloading pip-24.0-py3-none-any.whl (2.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 127.7 MB/s eta 0:00:00 Installing collected packages: pip Successfully installed pip-24.0 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
-
Install PyTorch using pip.
Current version of PyTorch (2.3.0) requires cuDNN version 8 and CUDA toolkit installed.
Install PyTorch.
Check installed PyTorch version.
Possible output:
Verify PyTorch installation.
Possible output:
-
Exit the container if finished.
-
Be sure export the container before leaving SLURM session. Not doing this will make your work removed once the SLURM session exited.
Possible output on finished:
[INFO] Creating squashfs filesystem... Parallel mksquashfs: Using 256 processors Creating 4.0 filesystem on /home/mycompid/mycontainer.sqsh, block size 131072. [=========================================================|] 160588/160588 100% Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072 uncompressed data, uncompressed metadata, uncompressed fragments, uncompressed xattrs, uncompressed ids duplicates are not removed Filesystem size 15547518.67 Kbytes (15183.12 Mbytes) 100.00% of uncompressed filesystem size (15548159.75 Kbytes) Inode table size 2220814 bytes (2168.76 Kbytes) 100.00% of uncompressed inode table size (2220814 bytes) Directory table size 1470358 bytes (1435.90 Kbytes) 100.00% of uncompressed directory table size (1470358 bytes) Xattr table size 2007 bytes (1.96 Kbytes) 102.40% of uncompressed xattr table size (1960 bytes) No duplicate files removed Number of inodes 50683 Number of files 43503 Number of fragments 3882 Number of symbolic links 2250 Number of device nodes 0 Number of fifo nodes 0 Number of socket nodes 0 Number of directories 4930 Number of ids (unique uids + gids) 1 Number of uids 1 root (0) Number of gids 1 root (0)
Verify the container is exported.
Possible output:
-
Exit SLURM session.
Execute command in the container image prepared by Enroot command¶
Below is the example to run a Python script in previous created container image via SLURM. SLURM must be used to expose GPUs resources to the process.
srun --gres=gpu:2g.20gb:1 --container-image=./mycontainer.sqsh --container-workdir=$HOME python3 ~/pytorch-sample.py
Description of the command:
--gres=gpu:2g.20gb:1
: The parameter means requesting SLURM to schedule the job with the one MIG GPU with name 2g.20gb
. If requested resources are not available, the job will be left in a PENDING state.
--container-image=./mycontainer.sqsh
: Tell enroot to load the image from local mycontainer.sqsh
file. Please note that the slash ./
is neccessary when referring file in current working directory because enroot will only look for the image locally when seeing the slash character (/
).
--container-workdir=$HOME
: The --container-image
argument is provided by Pyxis. This parameter tell Enroot to set the working directory to the value stored in environment variable named HOME
. This is usually the home directory of the logged in user.
python3 ~/pytorch-sample.py
: This is the command to be executed within the image.
The content of pytorch-sample.py
is as below (extract from PyTorch quick start guide):
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
# Download training data from open datasets.
training_data = datasets.FashionMNIST(
root="data",
train=True,
download=True,
transform=ToTensor(),
)
# Download test data from open datasets.
test_data = datasets.FashionMNIST(
root="data",
train=False,
download=True,
transform=ToTensor(),
)
batch_size = 64
# Create data loaders.
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)
for X, y in test_dataloader:
print(f"Shape of X [N, C, H, W]: {X.shape}")
print(f"Shape of y: {y.shape} {y.dtype}")
break
# Get cpu, gpu or mps device for training.
device = (
"cuda"
if torch.cuda.is_available()
else "mps"
if torch.backends.mps.is_available()
else "cpu"
)
print(f"Using {device} device")
# Define model
class NeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10)
)
def forward(self, x):
x = self.flatten(x)
logits = self.linear_relu_stack(x)
return logits
model = NeuralNetwork().to(device)
print(model)
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
# Download training data from open datasets.
training_data = datasets.FashionMNIST(
root="data",
train=True,
download=True,
transform=ToTensor(),
)
# Download test data from open datasets.
test_data = datasets.FashionMNIST(
root="data",
train=False,
download=True,
transform=ToTensor(),
)
batch_size = 64
# Create data loaders.
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)
for X, y in test_dataloader:
print(f"Shape of X [N, C, H, W]: {X.shape}")
print(f"Shape of y: {y.shape} {y.dtype}")
break
# Get cpu, gpu or mps device for training.
device = (
"cuda"
if torch.cuda.is_available()
else "mps"
if torch.backends.mps.is_available()
else "cpu"
)
print(f"Using {device} device")
# Define model
class NeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10)
)
def forward(self, x):
x = self.flatten(x)
logits = self.linear_relu_stack(x)
return logits
model = NeuralNetwork().to(device)
print(model)
print("Training Model")
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
def train(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
model.train()
for batch, (X, y) in enumerate(dataloader):
X, y = X.to(device), y.to(device)
# Compute prediction error
pred = model(X)
loss = loss_fn(pred, y)
# Backpropagation
loss.backward()
optimizer.step()
optimizer.zero_grad()
if batch % 100 == 0:
loss, current = loss.item(), (batch + 1) * len(X)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
def test(dataloader, model, loss_fn):
size = len(dataloader.dataset)
num_batches = len(dataloader)
model.eval()
test_loss, correct = 0, 0
with torch.no_grad():
for X, y in dataloader:
X, y = X.to(device), y.to(device)
pred = model(X)
test_loss += loss_fn(pred, y).item()
correct += (pred.argmax(1) == y).type(torch.float).sum().item()
test_loss /= num_batches
correct /= size
print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
epochs = 5
for t in range(epochs):
print(f"Epoch {t+1}\n-------------------------------")
train(train_dataloader, model, loss_fn, optimizer)
test(test_dataloader, model, loss_fn)
print("Done!")
classes = [
"T-shirt/top",
"Trouser",
"Pullover",
"Dress",
"Coat",
"Sandal",
"Shirt",
"Sneaker",
"Bag",
"Ankle boot",
]
model.eval()
x, y = test_data[0][0], test_data[0][1]
with torch.no_grad():
x = x.to(device)
pred = model(x)
predicted, actual = classes[pred[0].argmax(0)], classes[y]
print(f'Predicted: "{predicted}", Actual: "{actual}"')
classes = [
"T-shirt/top",
"Trouser",
"Pullover",
"Dress",
"Coat",
"Sandal",
"Shirt",
"Sneaker",
"Bag",
"Ankle boot",
]
model.eval()
x, y = test_data[0][0], test_data[0][1]
with torch.no_grad():
x = x.to(device)
pred = model(x)
predicted, actual = classes[pred[0].argmax(0)], classes[y]
print(f'Predicted: "{predicted}", Actual: "{actual}"')
The possible output after executing the command is:
Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64
Using cuda device
NeuralNetwork(
(flatten): Flatten(start_dim=1, end_dim=-1)
(linear_relu_stack): Sequential(
(0): Linear(in_features=784, out_features=512, bias=True)
(1): ReLU()
(2): Linear(in_features=512, out_features=512, bias=True)
(3): ReLU()
(4): Linear(in_features=512, out_features=10, bias=True)
)
)
Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64
Using cuda device
NeuralNetwork(
(flatten): Flatten(start_dim=1, end_dim=-1)
(linear_relu_stack): Sequential(
(0): Linear(in_features=784, out_features=512, bias=True)
(1): ReLU()
(2): Linear(in_features=512, out_features=512, bias=True)
(3): ReLU()
(4): Linear(in_features=512, out_features=10, bias=True)
)
)
Training Model
Epoch 1
-------------------------------
loss: 2.311987 [ 64/60000]
loss: 2.296749 [ 6464/60000]
loss: 2.278760 [12864/60000]
loss: 2.267926 [19264/60000]
loss: 2.258412 [25664/60000]
loss: 2.225776 [32064/60000]
loss: 2.235035 [38464/60000]
loss: 2.198730 [44864/60000]
loss: 2.192165 [51264/60000]
loss: 2.163504 [57664/60000]
Test Error:
Accuracy: 45.5%, Avg loss: 2.160435
Epoch 2
-------------------------------
loss: 2.174705 [ 64/60000]
loss: 2.159503 [ 6464/60000]
loss: 2.106915 [12864/60000]
loss: 2.117778 [19264/60000]
loss: 2.075339 [25664/60000]
loss: 2.014432 [32064/60000]
loss: 2.040038 [38464/60000]
loss: 1.959843 [44864/60000]
loss: 1.951941 [51264/60000]
loss: 1.890565 [57664/60000]
Test Error:
Accuracy: 59.4%, Avg loss: 1.887667
Epoch 3
-------------------------------
loss: 1.928848 [ 64/60000]
loss: 1.885536 [ 6464/60000]
loss: 1.776609 [12864/60000]
loss: 1.809799 [19264/60000]
loss: 1.709673 [25664/60000]
loss: 1.662814 [32064/60000]
loss: 1.678832 [38464/60000]
loss: 1.583767 [44864/60000]
loss: 1.592498 [51264/60000]
loss: 1.495840 [57664/60000]
Test Error:
Accuracy: 61.4%, Avg loss: 1.516900
Epoch 4
-------------------------------
loss: 1.592645 [ 64/60000]
loss: 1.541337 [ 6464/60000]
loss: 1.404812 [12864/60000]
loss: 1.473237 [19264/60000]
loss: 1.354290 [25664/60000]
loss: 1.348429 [32064/60000]
loss: 1.361446 [38464/60000]
loss: 1.291682 [44864/60000]
loss: 1.316547 [51264/60000]
loss: 1.219320 [57664/60000]
Test Error:
Accuracy: 63.4%, Avg loss: 1.250891
Epoch 5
-------------------------------
loss: 1.336257 [ 64/60000]
loss: 1.299178 [ 6464/60000]
loss: 1.149520 [12864/60000]
loss: 1.252716 [19264/60000]
loss: 1.123262 [25664/60000]
loss: 1.144166 [32064/60000]
loss: 1.165725 [38464/60000]
loss: 1.110545 [44864/60000]
loss: 1.144198 [51264/60000]
loss: 1.057242 [57664/60000]
Test Error:
Accuracy: 64.7%, Avg loss: 1.084700
Done!
Predicted: "Ankle boot", Actual: "Ankle boot"
Predicted: "Ankle boot", Actual: "Ankle boot"
Work with pre-pulled Apptainer (Singularity) images¶
Some images are pre-pulled and converted to Apptainer image and stored under /container-image
with file extension .sif
.
You can list the Apptainer images from the login node using below command:
You can use the apptainer
command to work with the Apptainer image.
-
First open a terminal to a compute node. (More info)
-
Load Apptainer module:
-
Open the terminal of Apptainer image. In this example, we are working with
The/container-image/tensorflow:23.11-tf2-py3.sif
--nv
is the parameter to enable NVIDIA GPUs in Apptainer.
Once the image loaded, you will reach a line like: -
You are now in the terminal of the image. Your home directory is mapped automatically. You can execute your program now. For example:
-
To exit the terminal of the image. Either execute
exit
or type Ctrl+D.
Using Anaconda 3 (via SSH)¶
Work with pre-installed Anaconda 3¶
Anaconda 3 are pre-installed in the server. The TensorFlow, PyTorch and SciPy (from conda-forge channel) are installed to conda-forge
environment.
To work with Anaconda 3:
-
Be sure you are in the terminal of a compute node (More info)
-
Load Anaconda 3 module:
Theconda-forge
environment will be activated by default. You can check the conda environment by commandconda info
. -
You can check what packages and what version are installed by this command:
Create own conda environment and install own packages to home directory¶
The default installed packages may not meet your needs (e.g. You may want to use other Python version or install additional packages). You can create your own conda environment to install different Python version and any other packages.
-
Check conda command is available in your session:
-
If the command
conda
is not found in yourPATH
, be sure loading Anaconda 3 in your session: -
Create your own environment. We named it as
myenv
in the example.
This will create a conda environment namedmyenv
with Python 3.10 installed from channelconda-forge
. Installing packages fromconda-forge
channel is recommended because it should have more up-to-date packages. The environment will be located at/home/mycompid/.conda/envs/myenv
.
The parameter:
-n myenv
: The name of the target environment.
-c conda-forge
: Install packages from specific channel.
python=3.10
: Packages to be installed into the new environment.
-
Activate your environment:
If you have not executeconda init
before, you are required to execute this command first. The command will modify your login script(s) to load the Anaconda 3 base environment every time you logged in to the HPC servers. To reverse this change, you can execute commandconda init --reverse
.
For everytime you logged into login node, please source your login script to set up your Anaconda 3 environment:
The below command is to activate your own environment (with environment name:myenv
). You can check the active environment byconda info
command. -
To install packages using conda from
conda-forge
channel, you can execute command likeconda install -c conda-forge <package name>
. For example, the command to install pytorch tensorflow with CUDA version 12.2 into my own environment is:Once finished, you can verify the installation byconda install -c conda-forge pytorch=*=cuda120* tensorflow=*=cuda120* torchvision=*=cuda120* cuda-version=12.2
conda list pytorch
command. You can uninstall the package using this commandconda uninstall pytorch
. -
To instlal packages using pip, you can execute command like
If you have an active conda environment, pip will install packages into it. So you can also verify the installation bypip install <package name>
. For example, the command to install openai 1.23 into my own environment is:conda list openai
command too. You can uninstall the package using this commandpip uninstall openai
. This will only uninstall the specific package. All dependencies will not be uninstall automatically. You will have to uninstall them manually if needed. -
(Optional) If you want to use the environment in JupyterHub, please install
cm-jupyter-eg-kernel-wlm
usingpip
in your environment: -
To deactivate your own environment, you can execute command:
You will go back to previous environment. You can check your current environment byconda info
command. You can re-activate the environment by executingconda activate myenv
again.
Additional note: If you want to remove your own environment you can execute below command (for example removing environment with name myenv
):
Using JupyterHub¶
What is kernel and what is kernel template?
Kernel is used to run your code on server in Jupyter Notebook. Kernel template is used to help user to create the kernel to be used in Jupyper Notebook.
Access and login JupyterHub¶
- Access JupyterHub by this link https://hpclogin.comp.polyu.edu.hk:8000/.
- Enter your COMP ID and password to the appropriate text box. Click "Sign in" button to login.
For details about JupyterHub interface, please refer to JupyterHub official documentation.
Create a Jupyter Notebook that work with SLURM¶
-
Create a kernel from kernel template
Note: If you have created kernel that works with SLURM, you do not need to re-create another one as long as resources requirement is the same.- Click the icon at the left side bar.
- Click the next to the kernel template. Here we will use "Python 3.9 via SLURM" as an example.
- Select the GPU resource to be used in this kernel. Please be reminded that the resource cannot be changed after kernel creation. If you want to use other resource, you will need to create another kernel. The MIG profile has a limitation that only one GPU resource can be used in a process.
- Change the display name to the one that you can identify the kernel.
- You can see the newly created kernel at shown in the image.
-
Once you created a kernel, some new icons linked to that kernel is shown in Launcher. Go to Launcher and click the kernel icon under "Notebook".
-
The Jupyter Notebook is created. Please verify the status of the kernel at the top right corner. If it is not the right kernel or the kernel status icon is not a hollow circle . Please select the proper kernel and try again. Otherwise, your code cannot be run at compute node.
If everything is fine, you can start coding in the Jupyter notebook. -
You can, at any time, change the notebook kernel by clicking the active kernel name at the top-right corner of your notebook.
Note: If the message "Error from Gateway: [Timeout during request] Exception while attempting to connect to Gateway server url 'https://localhost:8890'. Ensure gateway url is valid and the Gateway instance is running." was prompted when you selecting the kernel, usually because the resources you are requesting cannot be allocated. When you switch a kernel via SLURM in Jupyter Notebook, a new job will be scheduled in SLURM. The SLURM job will not be cancelled automatically even if the kernel is timed-out. In this case, you will need to cancel it manually.
To cancel the kernel job:
1. Open a terminal.
2. Execute module load slurm
3. Execute squeue -u <my COMP ID>
(Replace <my COMP ID>
with your COMP ID) to get a list of active jobs. Locate the JOBID of the pending job ("PD" under "ST" column and "(QOSMaxGRESPerJob)" under "NODELIST(REASON)" column).
4. Execute scancel <JOB ID>
(Replace <JOB ID>
with the value you get in point 3 above).
5. The job should be cancelled. You can verify by executing squeue -u <my COMP ID>
again.
How to use Anaconda3 in JupyterHub?¶
- (First time only) Connect to the HPC server via SSH (Reference) or open a terminal in JupyterHub (Reference).
- (First time only) Load Anaconda 3 module by command:
module load anaconda3
or install your own Anaconda 3 environment. - (First time only) Execute
conda init
- Login JupyterHub (Reference)
- Create a kernel from kernel template named "Python Conda kernel on SLURM" with proper resource. (Reference)
- In the "Environment to use" field, select the anaconda environment to be used in the kernel.
- Create a new Jupyter notebook using the kernel created from "Python Conda kernel on SLURM" (Reference) or switch the kernel in existing Jupyter notebook.
- Please verify the status of the kernel at the top right corner. If it is not the right kernel or the kernel status icon is not a hollow circle . Please select the proper kernel and try again. Otherwise, your code cannot be run at compute node.
If everything is fine, you can start coding in the Jupyter notebook.
Note
The module cm-jupyter-eg-kernel-wlm
must be installed in your environment. If it is not installed, please install it using pip
in your environment:
Open a terminal in JupyterHub¶
- Create a new "Launcher".
- Click "Terminal" icon.
- A terminal is opened and can execute commands like managing conda environment.
What kernels and kernal templates are available?¶
The below kernels and kernel templates are available in the server.
Name | Kernel or Kernel Template | Description |
---|---|---|
Python 3 | Kernel | Your code will run on the login node. This is not recommended. |
Python 3.9 via SLURM | Kernel Template | Your code will run on the compute node if the resource required is available. The system Python version 3.9 will be used. |
Python Conda kernel on SLURM | Kernel Template | Your code will run on the compute node if the resource required is available. The anaconda environment you selected during kernel creation will be used. Note: If this item is not visible in your interface, please execute conda init in terminal and reload JupyterHub. |