Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Line: 11 to 11 | ||||||||
From the snippet of logs, it looks like Maui decided somehow to delete the job. SIGTERM (15) is the first signal that Torque sends to the job's process; if it fails to exit in a short period, it then sends SIGKILL (9), which can't be caught/ignored. We sometimes have users catch TERM in their job script, and do some cleanup. I'd look into why Maui decided to delete it, if I were you. That's likely the root of the problem. | ||||||||
Added: | ||||||||
> > | 11月17日 由于bes的计算资源不够用,所以把mbh队列的全部资源16*8,放到了bes组里面 如果mbh需要,需要把这些计算资源再放回去。 | |||||||
\ No newline at end of file |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Added: | ||||||||
> > |
not clear to me whether there is a correlation between the signal number (SIGTERM, or signal 15), and the program's exit status. If a program that is killed by signal 15, sends a exit code of 15, and if the offset is 256, that would explain the exit code you see of 271 (256+15). From the snippet of logs, it looks like Maui decided somehow to delete the job. SIGTERM (15) is the first signal that Torque sends to the job's process; if it fails to exit in a short period, it then sends SIGKILL (9), which can't be caught/ignored. We sometimes have users catch TERM in their job script, and do some cleanup. I'd look into why Maui decided to delete it, if I were you. That's likely the root of the problem. |