--
KanBowen - 2011-11-07
11月7日
[torqueusers] Exit_status=271
I know that exit status gets offset by some number (128? 256?), but it's
not clear to me whether there is a correlation between the signal number
(SIGTERM, or signal 15), and the program's exit status. If a program
that is killed by signal 15, sends a exit code of 15, and if the offset
is 256, that would explain the exit code you see of 271 (256+15).
From the snippet of logs, it looks like Maui decided somehow to delete
the job. SIGTERM (15) is the first signal that Torque sends to the
job's process; if it fails to exit in a short period, it then sends
SIGKILL (9), which can't be caught/ignored. We sometimes have users
catch TERM in their job script, and do some cleanup.
I'd look into why Maui decided to delete it, if I were you. That's
likely the root of the problem.