Frequently Asked Questions

From ACENET
Jump to: navigation, search


Achtung.png Legacy documentation

This page describes a service provided by a retired ACENET system. Most ACENET services are currently provided by national systems, for which please visit https://docs.computecanada.ca.


General errors

"Fsync failed"

If you are getting this error message from the vi editor, then you are very likely over your quota. You will need to free up the space by deleting files from the filesystem where you are getting the error.

Quota reached

In some cases, the storage system may mistakenly report that you have reached your storage quota. Please refer to the following page for details.

"Couldn't agree a client-to-server cipher"

You may be using an old version of PuTTY or other SSH client. Try updating your SSH client software. In October 2017 we stopped supporting certain old and weak encryption algorithms for SSH connections. A similar error message is, "no matching mac found: client hmac-md5..."

Running jobs

"Error: No suitable queues"

You will get the message "Unable to run job: error: no suitable queues" when you submit the job if Grid Engine finds it could never run. You may have failed to provide a run time (h_rt), requested more memory than can be provided (h_vmem), or mistyped the name of a queue or parallel environment. If you want some hints from Grid Engine about why it can't be scheduled, re-submit it with "-w v":

$ qsub -w v ...other options...

If you know the job can't be scheduled and want it to go in anyway, you can use "-w w" or "-w n" options to override the default "-w e".

No output or truncated output

If your job output is mysteriously truncated or you get no output at all, or you receive the "Killed" message, it might be because:

  • the program exceeded a run-time limit (h_rt). Look at the output you got and try to determine why the program is taking longer than you expected. If you think you know why, resubmit with a larger value for h_rt.
  • the program ran out of memory (h_vmem). See Memory Management for a detailed discussion.
  • the program crashed due to an internal error — check your code for bugs
  • the disk quota for the storage the program is writing to was reached — See disk quotas for a list of quotas and commands to check disk usage.

If your job has been terminated unexpectedly (for example it has exit_status 137 in the 'qacct' records) and it did not violate the run-time limit (h_rt) then it may have violated a memory limit (h_vmem). Try increasing h_vmem or decreasing the size of some large arrays in your program, and resubmit. Use the qacct command to get Grid Engine accounting data about your job: how much memory it consumed, how long it ran, etc.

Permission denied (publickey,password,keyboard-interactive)

If you are seeing this error in the output of your job, it is most likely that your passwordless SSH settings are not configured correctly.

Eqw: Job waiting in error state

If qstat or showq has "Eqw" in the STATE column for one of your jobs you can use

$ qstat -j jobid | grep error 

to check the reason. If you understand the reason and can get it fixed, you can clear the error state with

$ qmod -cj jobid

Some common error messages are these:

can't chdir to directory: No such file or directory
can't stat() "file" as stdout_path

These indicate that some directory or file (respectively) cannot be found. Verify that the file or directory in question exists, i.e., you haven't forgotten to create it and you can see it from the head node. If it appears to be okay, then the job may have suffered a transient condition such as a failed NFS automount, or an NFS server was temporarily down. Clear the error state as shown above.

My job is stuck in the 'dr' state

Sometimes users find their jobs being stuck indefinitely in the dr state. The d state indicates that a qdel has been used to initiate job deletion. The reason a job gets stuck in this state is because Grid Engine loses communication with one of the compute nodes, and thus cannot cleanly remove the job from a queue - hence the dual state: (r)unning and being (d)eleted. Usually, this is due to a failed compute node, which was probably the reason why you tried to remove the job in the first place - because it was not making any progress. Please let us know about such jobs in the dr state so that we can remove faulty nodes out of production as soon as possible. You can also try to force delete your job like so:

 $ qdel -f job_id

If it does not help, let us know and we will delete it for you.

Job won't start

There are a lot of possible reasons why your job does not start right away.

  • There may not be enough CPU slots free.
  • There may not be enough slots in the time-limit queue (medium.q, long.q) your job qualifies for.
  • There might be enough slots but not enough memory because some of the running jobs have reserved a lot of memory.
  • A higher priority job may be reserving slots for a large parallel run.
  • An outage may be scheduled to begin before your job would end. (See Cluster Status for planned outages.)
  • Serial jobs are not scheduled on most 16-core shared memory hosts.
  • An individual research group cannot occupy more than 80% of the slots on a given cluster.
  • Some other requestable resource (e.g. Myrinet endpoints, Fluent licenses) may not be available.

You can query the Grid Engine for hints about why your job hasn't yet run this way:

$ qalter -w v job_id

However, even the output from qalter is sometimes obscure. If you want more assistance interpreting your jobs' situation, contact us.

Rr: Job re-started

A capital R in the job status in qstat or showq signifies "rescheduled" or "rescheduling". If a host goes down while running a job, Grid Engine will put the job back in the waiting list to be run again. "Rr" means it has restarted and is running again; "Rq" means it is waiting to be restarted. We strongly recommend you verify that a restarted job is progressing as you would expect by checking the output file. Not all applications recover gracefully on a restart like this.

If you want your jobs not to be rescheduled when a host fails in the middle, set the "rerun" option to "no" in your job script like this:

#$ -r no

If you like the restart feature but your application doesn't handle it automatically and gracefully, you can also write your script to detect when it has been restarted by checking if the environment variable $RESTARTED == 1.

Why doesn't my job start right away?

This could be for a variety of reasons. When you submit a job to the N1 Grid Engine you are making a request for resources. There may be times when the cluster is busy and you will be required to wait for resources. If you use the qstat command, you may see qw next to your job. This indicates that it is in the queue and waiting to be scheduled. If you see an r next to your job then your job is running.

That said, it is often not clear what resources are missing that are preventing your job from being scheduled. Most often it is memory that is in short supply, h_vmem. You may be able to increase your job's likelihood of being scheduled if it requires only few resources by reducing the job's memory requirements. For example:

$ qalter -l h_vmem=500M,h_rt=hh:mm:ss job_id

will reduce the virtual memory reserved for the job to 500 megabytes. You must re-supply the h_rt and any other arguments to -l when you use qalter. The default values are listed here. Note that for parallel jobs, this h_vmem request is per process. The scheduler will only start your job if it can find a host (or hosts) with enough memory unassigned to other jobs. You can determine the vmem available on various hosts with

$ qhost -F h_vmem

or you can see how many hosts have at least, say, 8 gigabytes free with

$ qhost -l h_vmem=8G

You can also try defining a short time limit for the job:

$ qalter -l h_rt=0:1:0,other args job_id

imposes a hard run-time limit of 0 hours, 1 minute, 0 seconds (0:1:0). In certain circumstances the scheduler will be able to schedule a job that it knows will finish quickly, where it cannot schedule a longer job.

qalter gotcha at Placentia: Following the simple example above can land your job in the subordinate queue where it is subject to being suspended. Now, if you want your job to start sooner this might be A Good Thing, but if you'd rather wait for a regular queue slot then here's how you avoid that:

An additional -l resource request, suspendable=true, is added to your job quietly by default. qalter replaces the entire list of -l arguments, so if you just want to change h_rt you should first extract the resource list like this, and incorporate the modified but complete list into the qalter command. For example:

$ qstat -j 1098765 | grep resource_list
hard resource_list:         test=false,suspendable=false,h_stack=10M,h_vmem=1G,h_rt=169200
$ qalter -l test=false,suspendable=false,h_stack=10M,h_vmem=800M,h_rt=12:0:0 1098765 

As you can see, suspendable is not the only resource request that gets quietly added to your job by default!

I need to change the order of my waiting jobs

You can shuffle the order of your own jobs with the "job share" option to qalter or qsub. For example,

$ qalter -js 100 job_id

will boost the priority of the given job relative to the other jobs belonging to you. The change in priority may take 15 seconds or so to be reflected in qstat.

How to merge output files

Usually there are four output files generated by Grid Engine for a parallel job: .o, .e, .pe and .po. The .*e files represent stderr, and the .*o files represent stdout, while the .p* files get generated when there is a parallel environment specified in a submission script.

Users may want to reduce that number of files to just one. Here is how. The stderr and stdout streams can be merged with the following option in your submission script:

 #$ -j y

This will yield two files .o and .po instead of four. You can merge these two further if you explicitly specify a name for the output file like so:

 #$ -o output.log

The problem here is that if you submit several such jobs from the same directory, they will be writing to the same file, which is usually undesirable, so the advice would be to use a Grid Engine environment variable to set a unique file name like so:

 #$ -o $JOB_ID.log

So, the recipe to get four files into one with a unique name is:

 #$ -j y
 #$ -o $JOB_ID.log

Warning: No xauth data; using fake authentication data for X11 forwarding

Your ~/.Xauthority file got corrupted. Please delete it manually, and next time you login with X11 forwarding enabled, it will get re-created.

I need to extend the run-time limit for a running job

You cannot alter parameters of a running job. If you are sure that your jobs will not be able to finish on time, then it might be better to terminate and then re-submit it again with a different run-time limit.

Requesting resources

I need a lot of CPUs

You might find it necessary to turn on the job reservation option to qsub:

$ qsub -R y ...other parameters...

I need to run a quick test

You can run your job in the test queue only by asking for it explicitly like so:

$ qsub -l test=true,h_rt=0:0:10 ...other parameters...

Jobs in the test queue must also have run times of less than one hour (h_rt=01:00:00).

I need to run jobs longer than a month

Check the conditions and caveats associated with Subordinate Queues to see if this solution might work for you. Otherwise, this requires special approval of the Resource Allocation Committee (RAC). Contact support first for advice on making an application to the RAC.