GPUs on the Grid

From CLSP Wiki
(Redirected from GPUs on the grid)
Jump to: navigation, search

Machine Information

We currently have >150 GPU slots on the CLSP grid, as shown in the following table (updated from JHU MT Wiki)

Machines GPU per Machine GPU Total Model Memory
b01-05,07-10 4 36 Tesla K10.G2 3526 MiB
b06 2 2 Tesla K20m 4742 MiB
b11-18,20 4 32 Tesla K80 11439 MiB
b19 2 2 Tesla M40 22939 MiB
c01-c11 4 44 GTX 1080ti 11172 MiB

To check the current number of available GPUs, users and number of GPUs they used, you can use this script: ~keli1/scripts/gpu_check.sh.

Configuring CUDA and CuDNN

The CLSP machines usually have the most recent version of CUDA. To link to it add

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64

As of 2018, the above address is pointed to CUDA 9.1. If you would like to use CUDA 7.5, add /opt/NVIDIA/cuda-7.5 to LD_LIBRARY_PATH instead.

Obtaining CuDNN

Some frameworks, like Tensorflow, require the recent version of CuDNN. You can obtain it for free after answering a short survey.

It is also possible someone has a copy of the recent libcudnn.so.# file somewhere on the grid. You can also copy it from them too.

Linking CuDNN

After downloading CuDNN onto the grid, you want to add it to your LD_LIBRARY_PATH. Simply untarring and adding

tar -xvf cudnn-version_number-filename
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:current_path/cuda/lib64

to your ~/.bashrc file should work. If you need to move the files, make sure you relink it.

Running GPU Jobs

First and most important rule to keep in mind:

Always request GPUs and only use free GPUs.

Unlike CPU, GPU resources are not divisible and cannot be shared across several processes. All the GPU processes living on the grid would have to exclusively occupy one or more GPUs, and that's what makes GPU resources scarce. If you submit GPU jobs without requesting GPUs, or you didn't run your job on the free GPU which you are supposed to use, you will either incur a GPU killer and the admins will be notified, or under some worse circumstances you will kill other people's jobs and cause huge computation time waste.

The information on this page assumes that you have sufficient knowledge to run CPU jobs or at least have read this introduction. If you haven't done so, please do it first.

Requesting GPUs

To request k GPUs on a machine, use either qsub and qlogin and add flag -l 'gpu=k'. If you qlogin then also specify a time limit such as -l h_rt=8:00:00 so that you don't inadvertently hoard gpu slots inside a screen session (and if you want to wait for a free gpu, add -now no otherwise it will immediately exit if no gpu's are currently available). It is also suggested to submit jobs to specific queue for GPU, specified by -q g.q (note that you should not submit to gpu.q unless you have been told to do so). You can combine GPU flags and other resource flags, especially if you plan to use more memories than 1G, for example:

qsub -l 'gpu=1,mem_free=12g,ram_free=12g' -q g.q xxx.sh

would request one GPU and 12GB of memory.

And in case you only want K80s or GTX1080Tis,

qsub -l 'hostname=b1[12345678]*|c*,gpu=1' -q g.q xxx.sh

will guarantee one of these large memory GPUs for you.

It can be difficult to estimate how much memory a GPU job will take. To give you some idea of it, most NMT toolkits use memory ranging from 6GB up to 16GB. You don't need to reserve CUDA memories.

Locating Free GPUs

Requesting GPUs will only guarantee a certain number of free GPU for you on certain machine, but for most (if not all) of the deep learning frameworks, it's up to you to choose which GPU to use, so it's also up to you to figure out which GPU is free. Again, if you didn't run jobs on a free GPU you will either incur a GPU killer and the admins will be notified, or under some worse circumstances you will kill other people's jobs and cause huge computation time waste.

To see which GPU is free on a specific machine, ssh to that machine and run nvidia-smi. Here is an example of the output:

shuoyangd@b18:~$ nvidia-smi
Fri Sep  1 13:41:48 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:06:00.0     Off |                    0 |
| N/A   69C    P0   120W / 149W |   2210MiB / 11439MiB |     81%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:07:00.0     Off |                    0 |
| N/A   40C    P0    79W / 149W |    110MiB / 11439MiB |     31%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 0000:85:00.0     Off |                    0 |
| N/A   26C    P8    26W / 149W |      2MiB / 11439MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 0000:86:00.0     Off |                    0 |
| N/A   21C    P8    29W / 149W |      2MiB / 11439MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
 
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     51727    C   python2                                       2208MiB |
|    1     70094    C   python                                         114MiB |
+-----------------------------------------------------------------------------+

However, for most of the cases you just to want the index to a free GPU, Gaurav Kumar wrote a script for this task. On any machine with a GPU, simply run free-gpu and it will give you the index to a free GPU. For example, if you see:

shuoyangd@b18:~$ free-gpu
3

that means GPU with index 3 is free.

If you want more than one gpus, simply do:

shuoyangd@b18:~$ free-gpu -n 2
2,3

which means GPUs with index 2 and 3 are free.

To the best of our knowledge, there are two ways of using these indexes:

  • For some programs, you would have a option to pass a GPU index to run on. In these cases the shell script you submit should look like;
devices=`free-gpu`
python some_gpu_jobs.py -gpuid $devices
  • In some other cases, the job takes gpu0 by default, you need to set environment variable CUDA_VISIBLE_DEVICES, as in the following example:
CUDA_VISIBLE_DEVICES=`free-gpu` python some_gpu_jobs.py

Debugging GPU jobs

To interactively debug GPU jobs, you can qlogin to a machine with GPU. But keep in mind that you should log out from these machines once you are done so that you don't occupy the GPU slot forever. Also, keep in mind that you should always keep figuring out what's the currently free GPUs as jobs might come and go.

Also, at times when GPU queue is full, it might be hard to get a GPU to debug on instantly, and by default qlogin will quit after several seconds of waiting. If you choose to wait, run

qlogin -l 'gpu=1' -now no

and the script will hang until you get a GPU. The command line options for submitting jobs can also be passed in here (e.g. selecting a host).

Lookup Your Quota

At times you may find there are still free GPUs hanging around but your jobs getting stuck in the queue. The most probable cause is that CLSP strictly enforces a GPU quota for each of its users (depending on the attribution of fundings), and it is likely that you have run out of your quota. To look up your quota, run qconf -srqs | grep [username].

If you need to run more concurrent GPU jobs, talk to the grid administrator.