(Old) Introduction to the CLSP Grid

From CLSP Wiki
Revision as of 09:19, 1 September 2017 by Cmay (Talk | contribs)

(diff) ←Older revision | view current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search

Grid Intro

We assume you have a username and password. If not, and you need one, you may obtain one from clsphelp@clsp.jhu.edu.

The default shell we give to new users is bash. Our machines run Debian linux. The 'a' machines (a01 through a18) each have many CPUs and have a lot of memory (typically around 100G). Our machines (b01-b19) additionally have GPU's.

To access our cluster you should ssh to login.clsp.jhu.edu or login2.clsp.jhu.edu.

When you first log in you should change your password from the one our administrator sent you; you can do this with the command "yppasswd". This will work from any machine. Also, please do "ypchfn" and put in your email address in the "office" field and mention someone who you are working with.

To make it easier to get around the grid you can run the following commands

 ssh-keygen  ## just press enter a couple of times at the prompts
 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

This will enable you to ssh to any node without entering your password, for instance:

 ssh a05

Typically it is better to do things like compilation and interactive work such as editing on a randomly chosen node rather than on "login" or "login2" -- this avoids load on those machines. You are encouraged to learn to use the program "screen" so that your work doesn't get interrupted if you lose your ssh connection.


The main rule you should be aware of regarding storage is: don't cause too much I/O to bdc01. This will make the whole cluster unresponsive. bdc01 is the disk server where most home directories are located (type "du ~" to find out where yours is located). We ask that you limit the size of your home directory to 50G.

When you run experiments, create a directory for yourself on some other disk such as /export/a01, /export/a02 through /export/a14, except /export/a03 (which doesn't exist) and a06 and a07 (which are reserved). The directory should be the same as your username, for example jbloggs would create a directory with

  mkdir /export/a11/jbloggs

You can create such directories on multiple "a" disks if you need. We don't have individual disk-usage limits on these disks, but we do keep track of it and we might start to notice if you use more than a few hundred gigabytes.

Most "a" machines have a large disk in /mnt/data which is also exported with names such as /export/a10. With the exception of a06 (which is on a08:/mnt/data2) and a07 (which is on a12:/mnt/data2), you can refer to the directory as either <node>:/mnt/data or /export/<node> in your scripts (you can verify the location of an export directory , i.e: ypcat -k auto.export | grep a10). All machines have a space in /tmp which you can, in principle, use for local scratch space, but be careful not to fill up /tmp space.


The directories /export/aXX and /export/bXX (and most other /export/ directories) are *NOT BACKED UP*. We simply can't afford back up this much data. They are RAIDed which reduces the time to failure, but they sometimes do die. If you have code or scripts which you can't afford to lose, it's your responsibility to back them up. If you put things on bdc01 (i.e. your home directory, if located there), they will be backed up if they are under 50GB, but don't run experiments from your home directory. Personally (Dan Povey) I use git version control for all my important files, which I host on github. This can also work from local repositories, for example, hosted on your home directory. If this is too much for you, you can simply periodically copy your code or scripts to a disk physically hosted on a different machine (test this with the command "du"). We anticipate that you will ignore this advice-- typically people do not start backing up until they have experienced a major data loss. You were warned.

Parallel jobs

The way you should run parallel jobs on the nodes is to use the command "qsub". This is part of Sun GridEngine. We don't have time for a tutorial right here, but there seems to be a suitable one at http://www.uibk.ac.at/zid/systeme/hpc-systeme/common/tutorials/sge-howto.html. The only special thing you should know about our grid is that if you want to specify memory resources, you should do so with something like:

 qsub -l mem_free=10G,ram_free=10G my_job.sh

If your jobs use 1G of memory or less you don't need to do this. We do not use special queues, only the queue "all.q" which is the default so you don't need to specify the queue. GridEngine as we have configured it will ignore the hashbang e.g. "#!/bin/csh" at the top of your script. The default is /bin/bash and if you want to change this you need the -S option, e.g. -S /bin/csh. The normal procedure, though, is to simply have a bash script that calls whatever interpreter you really need (e.g. python or matlab).

The job you submit needs to be a shell script, which can't take command-line arguments; if you want to call Python or Matlab or something, do it from the shell script.

Some useful options are shown in the example below

qsub -l 'arch=*64*' -v PATH -cwd -j y -o /foo/bar/some_log_file.log my_script.sh

If you want to submit a job with multiple threads, you should use the -pe smp N option, e.g.

qsub -l 'arch=*64*' -pe smp 5 foo.sh

for a job that requires 5 threads (or 5 parallel single-threaded processes.. it's the number of processors that is being allocated here). Note that the ram_free option is now configured to be per-job, so if a 5-thread job takes 5G, you would specify -l mem_free=5G,ram_free=5G.

Be careful with Matlab, because the latest versions will by default use a large number of threads. To stop this you can write


in your matlab script to make it use only 1 thread. (Or you can specify some larger number, but make sure you submit with the correct -pe smp option).

You can set up dependencies between your jobs using the -hold_jid option to qsub, e.g.

qsub -hold_jid 19841,19842 foo.sh

won't run foo.sh until jobs 19841 and 19842 are done. You can limit how many of your jobs run at one time to, say, 10, by giving the option "-tc 10". This can be useful for jobs that run for a very long time or use a lot of I/O.

For jobs that do a lot of I/O, it can be helpful to have them run on the same machine as the data is located, e.g. if your data is on a11 it might be helpful to use the option:

  -l "hostname=a11*"

To tell qsub to only use b01 through b09 or b11 through b13:

  -l "hostname=b0[123456789]*|b1[123]*"

To exclude h01 or y01:

  -l 'hostname=![hy]*'

To use any b machine except b01:

  -l 'hostname=!b01*&b*'
  (Note single quotes needed if you use !)

Usually when things are slow it's because some disk that you're using is slow, usually because someone is running jobs that cause excessive I/O to either the disk that you're using, or the home-directory server. Logging into whichever nfs server it is (e.g. a12 for /export/a12) and running "sudo iftop", and correlating the machines that have a lot of network traffic, with the output of "qstat", can sometimes reveal who is the culprit.

If something you need is not installed, you can ask clsphelp@clsp.jhu.edu. For things that are Debian packages and present no obvious security threat, we will typically install it right away (the same day)-- for example, if you need "nano" and the package is not installed.

GPU Jobs


Please never use a GPU (i.e. never run a process that will use a GPU) UNLESS:

  • you have reserved it on the queue
e.g. by doing qlogin -l gpu=1
and running the process in that shell
[in which case log out promptly when done]
  • you are running the process in a script which you submit to the queue with -l gpu=1.

If you use a GPU without reservation, you will kill people's jobs and cause a huge amount of wasted time.

Also, there are a very limited number of GPUs (10) that are available for general use, i.e. that have been bought with general funds. Don't plan to use more than, say, two or three GPUs concurrently.

If qlogin seems to fail, try qlogin -now no


When you log in with -l gpu=1 you are reserving one GPU. Some frameworks, such as Theano and TensorFlow, may automatically use more than one GPU if they detect their presence. To prevent this you should run them as:

CUDA_VISIBLE_DEVICES=`/home/gkumar/scripts/free-gpu` python your_script.py

which sets the CUDA_VISIBLE_DEVICES variable to the integer identify of a GPU that's free right now.

Resource Usage

We allow all our collaborators to have an account, but for people outside CLSP, if we feel you are using too much resources we may limit the number of jobs you can run, or we may ask you to limit your disk usage or otherwise do something differently. Examples of bad things you should not do, and which will prompt us to email you, are:

  • Filling up disks
  • Causing excessive disk load by running too many I/O heavy jobs at once
  • Using up too much memory without a corresponding request to qsub (e.g. "-l mem_free=10G,ram_free=10G")
  • Running "too many" jobs. An example of "too many" jobs is
    • Submitting a parallel job with a thousand parts
    • Running 100 jobs that take a week to finish and are not part of some important government contract that is paying for our grid.
  • Running more than a handful of processes outside GridEngine
  • Copying data onto or off of the grid with excessive bandwidth (use rsync --bwlimit=2500).

You *can* do the following:

  • ssh directly to a node (e.g. "ssh a10") and running a reasonable number of interactive jobs there, e.g. compilation and debugging.


A few utilities are worth mentioning:

  • clspmenu - provides htop, iftop, iotop, wall, email to clsphelp, qstat grepping, log viewing...
  • q.who - displays users and their number of jobs (q.who vv breaks down grid nodes and statuses)
  • sgestat - displays users and their number of jobs with a total


For printing advice, refer to our "old" grid page at http://wiki.clsp.jhu.edu/view/Grid