Common Job Failures
The job died with a "exceeded MEM usage hard limit" error
This problem is accompanied by an error message that looks like this:
Job deleted at request of email@example.com job 123456 exceeded MEM usage hard limit (2000 > 1000)
Reason this occurs: Each job is assigned a set amount of RAM before the job starts. At least 1 processor assigned to job 123456 attempted to use 2000MB of RAM when only 1000MB was allocated.
The default memory assignment is 1GB of RAM per processor. For example, a 64 processor job (#PBS -l nodes=64, #PBS -l nodes=8:ppn=8, etc.) is allotted 64GB of RAM by default. The job script can change the default and request more or less memory.
There are two main ways to request memory - per-process, and per-job. To request a set amount of memory for the entire job, add this line to your job script:
#PBS -l mem=16gb
To request a set amount of memory for each processor assigned to the job add this line to your job script:
#PBS -l pmem=2gb
The "#PBS -l mem=16gb" requests exactly 16GB of RAM, no matter if 1 CPU or 100 CPUs are assigned to the job (#PBS -l nodes=1 or #PBS -l nodes=100).
The "#PBS -l pmem=2gb" requests exactly 2GB of RAM per processor assigned to the job. If 16 processors are assigned, the job will be allocated 32GB of RAM. If 64 processors are assigned, the job will be allocated 128GB of RAM.
If you encounter this error: The error message should say what the problem was and how much the job went over in memory usage. A safe recommendation is to use the "pmem" option, increasing the amount of memory requested by 1GB of RAM each time. e.g. if you asked for "pmem=2gb" and you saw this error, ask for "pmem=3gb" next time or increase the number of processors requested (assuming the job uses less memory when more processors are given to the same problem).
Matlab Job Failures
Job Failing with this kind of error:
Starting matlabpool using the 'local' configuration ... stopped.
Error using matlabpool (line 136) Failed to open matlabpool. (For information in addition to the causing error, validate the configuration 'local' in the Configurations Manager.)
Error in Job1 (line 12) matlabpool open local Caused by: Error using distcomp.interactiveclient/pGetSockets>iThrowIfBadParallelJobStatus (line 114)
The interactive parallel job finished without any messages.
Solution (WARNING: Try the solutions ONE AT A TIME! Do not implement multiple solutions without attempting each one by itself. Implementing all solutions concurrently can break things even more! ) :
This type of error is usually caused by a limitation in the environment. Whenever a job is executed, the Operating System (OS) tries to keep users from consuming too many resources. Consumption of an excessive amount of memory, or disk space, or files on a node typically indicates that an application is "out of control". The OS limits the potential damage from an out of control application by limiting what each environment can do (for example, one limitation is the number of files that can be open concurrently).
This occasionally causes problems for some scientific applications.
If you see this problem in matlab, here are the steps to try:
1. Are you opening a "local" pool with matlab? If so, make sure you include this snippet of code before calling "matlabpool open local". (See the matlab page for more information)
if ~strcmp(getenv('PBS_JOBID'),'') sched = findResource('scheduler','type','local'); local_scheduler_data=[sched.DataLocation,'/',getenv('PBS_JOBID')] mkdir(local_scheduler_data); sched.DataLocation=local_scheduler_data; end
2. Is it still broken? If so, increase the "open files" limit.
- View your current "open files" limit with "ulimit -a".
- Double that limit by editing your ~/.bashrc file and adding the line "ulimit -n
- Logout and relogin
- Check that the change worked by executing "ulimit -a" again
3. Is it still broken? If so, increase the "stack size" limit.
- Follow the same procedure as for the "open files" limit, except change "ulimit -n" to "ulimit -s".
4. Is it still broken? If so, increase the "max user processes" limit.
- Follow the same procedure as for the "open files" limit, except change "ulimit -n" to "ulimit -u".
These four solutions have been known to fix matlab problems where the local pool cannot start.