Skip to content

FAQ

I got connection timeout when connecting to login node by SSH.

The login node hpclogin.comp.polyu.edu.hk can be directly accessed from COMP lab rooms and COMP offices with wired connection. Otherwise, PolyU Staff VPN or Research VPN is required.

I want to upload file to my home directory.

Use your favorite SFTP/SCP program, for example WinSCP on Windows platform. Then connect to hpclogin.comp.polyu.edu.hk. You can access your home after login.

How much space remaining for my home directory?

You can use command df -h | grep home to check the home directory usage. The following example will be shown.

myloginidg@hpclogina:~$ df -h |grep home
hpcstore.comp.polyu.edu.hk:/hpchome           50G  6.7G   44G  14% /home
myloginidg@hpclogina:~$

The above example shows there is 44G remaining.

I cannot log in to the login node using my account and password.

Please make sure you are using your COMP ID and password to log in the login node. COMP ID and password, which is different from PolyU NetID, is the account to log in COMP Intranet. In general, you need to activate it before using it. If you have not activated it yet, please visit https://acct.comp.polyu.edu.hk to activate it. If you forgot your COMP ID password, you can also reset it there.

I requested two GPUs but my program can only use one.

This is the limitation of MIG Profile. Only one GPU and, by default, the first GPU is visible to a process. In order to let the process use another GPU, the UUID of the GPU is required to be set at CUDA_VISIBLE_DEVICES environment variable before starting the process. You can check the UUID of GPUs by command nvidia-smi -L. The possible output is

GPU 0: NVIDIA A800-SXM4-80GB (UUID: GPU-27c84131-089e-de40-7fd8-3b27fa146cef)
MIG 2g.20gb     Device  0: (UUID: MIG-caa9d092-e4e0-5e73-b65a-49c8219733c8)
MIG 2g.20gb     Device  1: (UUID: MIG-83e8d88b-2c83-5339-89ea-f44b15642f37)
For example, if you want to use MIG 2g.20gb Device 1 in your program named my_script.py, you can execute the below command in compute node shell:
CUDA_VISIBLE_DEVICES=MIG-83e8d88b-2c83-5339-89ea-f44b15642f37 python3 ./my_script.py
To use two GPUs at the same time, currently, you need to run the program in parallel, like:
CUDA_VISIBLE_DEVICES=MIG-caa9d092-e4e0-5e73-b65a-49c8219733c8 python3 ./my_script.py &
CUDA_VISIBLE_DEVICES=MIG-83e8d88b-2c83-5339-89ea-f44b15642f37 python3 ./my_script.py &

Why is my job always PENDING and never run?

Your job is in PENDING status because SLURM cannot allocate requested resources to run your program. Either all resources have been used or you have requested more resources than allowed. Please make sure the requested resources does not exceed the allowed limit. For example, if you are allowed to have at most 4 CPUs, the below command will never start because your job requires 6 CPUs:

srun -n 6 -N 2 bash ./myscript.sh

Command not found when I executing a command

Be sure you have loaded the appropiated module before executing the command. The table below listed out the module of some common commands.

Command Module to load
srun module load slurm
sbatch module load slurm
conda module load anaconda3
python module load anaconda3
apptainer module load apptainer
What modules are installed in the compute node?

Executing command module avail will list out the installed modules in compute node. You can load the module by the command: module load <module name> where <module name> is the name of the module which can be found in module avail command.

What Python packages are installed in the compute node?
  1. Load Anaconda 3 if not loaded before:
    module load anaconda3
    
  2. Execute:
    conda list
    
How can I check my pending and running jobs?

You can get the list of your pending and running jobs by below command:

squeue -u <my COMP ID>
Replace <my COMP ID> with your COMP ID. For example, if your COMP ID is mycompid, then the command becomes squeue -u mycompid.

How can I cancel my running or pending job?

You can run the below command to cancel your job:

scancel <SLURM job ID>
Replace <SLURM job ID> with JOB ID of your job.

Can I use Jupyter Notebook?

The HPC platform comes with JupyterHub. Please refer to the related HOWTOs section for information on using Jupyter Notebook in the HPC.

"Error Starting Kernel" was prompted when I am using JupyterHub. What is it?

Sometime you may received the below message when you are using JupyterHub:
Error Starting Kernel Example 1
Error Starting Kernel Example 2

If you received "Error Starting Kernel" error when you are using JupyterHub, in most case, your requested resources can not be allocated. The most possible reasone is that you are trying to request resources more than allowed. Please try to request fewer resources and try again.

You can execute command squeue -u <my COMP ID> (replace <my COMP ID> to your COMP ID) to check your current submitted jobs.

You may cancel jobs (especially those jobs with QOSMaxGRESPerJob)
Jupyter
by command scancel <SLURM job ID> (Job ID can be found by the squeue command above. Replace <SLURM job ID> to the job ID you get from squeue command.).

Please check Resources & Limits for the current resources limitation.