Tags:
create new tag
view all tags
-- ShiJingyan - 2011-09-16

当前情况下运行的作业超过设置的队列walltime后,系统没有将其删除,一直运行,有些作业的时间特别长。队列也设置了

set queue besq resources_max.walltime = 60:00:00

但作业运行超过60小时后,并没有被torque清理掉。

解决:

仅设置set queue besq resources_max.walltime = 60:00:00不够,队列还要设置:

set queue besq resources_default.walltime = 60:00:00

只有设置了default值,max.walltime的限制才能被激活并起作用。

参考网页: http://answerpot.com/showthread.php?412934-torque+does+not+kill+jobs+when+wall_time+or+cpu_time+reached

补充下问题:

在3.0.1上测试得出,resources_default.cput和resources_default.walltime起作用,resources_max.walltime和resources_max.cput是否添加,均不起作用。

Hi,
>
> I found why jobs are not killed when cput/wall_time is reached.
>
> # qstat -f 10626859|grep Resource_List
> Resource_List.neednodes = 1
> Resource_List.nodect = 1
> Resource_List.nodes = 1
>
>
> there's no default resource time limits.
>
> Resource_List.cput or Resource_List.walltime
> So I assume that my resource_max default values are not taken in consideration:
>
> resources_max.cput = 01:30:00
> resources_max.walltime = 03:00:00
>
>
>
> and that "breaks" what man says:
>
> resources_max
> The maximum amount of each resource which can be requested by a single job in this queue. The queue value supersedes any server wide maximum limit. For-
> mat: "resources_max.resource_name=value", see qmgr(1B); default value: infinite usage.
>
> resources_default
> The list of default resource values which are set as limits for a job residing in this queue and for which the job did not specify a limit. Format:
> "resources_default.resource_name=value", see qmgr(1B); default value: none; if not set, the default limit for a job is determined by the first of the fol-
> lowing attributes which is set: server’s resources_default, queue’s resources_max, server’s resources_max. If none of these are set, the job will unlimited
> resource usage.
>

please fill out a TORQUE bug report at www.clusterresources.com/bugzilla
_______________________________________________
___________________________________________________

Posted on the Www.supercluster.org mailing list. Go to http://www.supercluster.org/mailman/listinfo/torqueusers to subscribe.

#4

04-06-2010 03:22 PM

User

Default Search the Www.supercluster.org archives:


On Fri, 4 Jun 2010 10:45:54 +0200
Arnau Bria wrote:


[...]
> Seems a big bug to me, maybe some developer could give his opinion.

Andrey Kiryanov gave the answer:
1.-) correct a bug in src/include/pbs_config.h.in
RESOURCEMAXDEFAULT insted of RESOURCEMAXNOTDEFAULT
2.- ) enable --enable-maxdefault at configure time


and doc should be updated.

Cheers,
Arnau
_______________________________________________
___________________________________________________

Posted on the Www.supercluster.org mailing list. Go to http://www.supercluster.org/mailman/listinfo/torqueusers to subscribe.

#5

04-06-2010 05:36 PM

User

Default Search the Www.supercluster.org archives:


On Fri, 04 Jun 2010 13:30:59 +0200
Mgr. ? imon T??th wrote:

Hi Simon,

> > 1.-) correct a bug in src/include/pbs_config.h.in
> > RESOURCEMAXDEFAULT insted of RESOURCEMAXNOTDEFAULT
> > 2.- ) enable --enable-maxdefault at configure time
> >
> >
> > and doc should be updated.
>
> That wouldn't make much sense. Max is max for submit and that's the
> way it should be. The problem is that server doesn't reject jobs with
> infinite requirements when the max is set.

I don't know if I've understood you, but I think we agree :-)

If a max or default is set at queue level, all jobs from that queue
should take those values by default. are you saying so?

I'd like to hear some devel opinion on that, I'm sure there must be a
good reason for changing previous (2.3) behaviour.

Cheers,
Arnau
_______________________________________________
___________________________________________________

Posted on the Www.supercluster.org mailing list. Go to http://www.supercluster.org/mailman/listinfo/torqueusers to subscribe.

#6

04-06-2010 09:20 PM

User

Default Search the Www.supercluster.org archives:


>>> 1.-) correct a bug in src/include/pbs_config.h.in
>>> RESOURCEMAXDEFAULT insted of RESOURCEMAXNOTDEFAULT
>>> 2.- ) enable --enable-maxdefault at configure time
>>>
>>>
>>> and doc should be updated.
>>
>> That wouldn't make much sense. Max is max for submit and that's the
>> way it should be. The problem is that server doesn't reject jobs with
>> infinite requirements when the max is set.
>
> I don't know if I've understood you, but I think we agree :-)
>
> If a max or default is set at queue level, all jobs from that queue
> should take those values by default. are you saying so?
>
> I'd like to hear some devel opinion on that, I'm sure there must be a
> good reason for changing previous (2.3) behaviour.

Well, not precisely.

If you don't request any limit, then the assumed semantics is that the
limit is infinite (for walltime, the job will never be killed for
running too long).

Maximum limits submits. It allows you to have a priority queue for short
jobs by setting short maxmimum walltime and high priority on the queue.
No job requesting longer walltime then the set maximum limit will be
permitted into this queue.

Default values are values set for jobs that don't have any limitation.
Very simply, if a job with no set limit comes, it assumes the default
value (from server, or from queue, depending on where it is set).

The problem described here appears when you only set a maximum value and
not a default value. Because you don't have a default value (on
server/queue), the job doesn't receive any additional limitation (this
is OK) and is also permitted to enter the server/queue (this is WRONG),
even though it actually requests infinite amount of the resource and
infinite > any max value set on server/queue.

--
Mgr. ? imon T??th

#7

05-06-2010 06:14 AM

User

Default Search the Www.supercluster.org archives:


On Fri, Jun 4, 2010 at 5:37 PM, David Singleton

> If procs is going to mean processors/cpus then I would suggest there needs
> to be a lot of code added to align nodes and procs - they are specifying
> the same thing.


Moab treats them the same if you do not specify ppn with your nodes
request, however TORQUE is pretty much unaware of what -l procs=X
means - it just passes the info along to Moab. I would like to see
procs become a real torque resource that means give me X total
processors on anywhere from 1 to X nodes.
_______________________________________________
___________________________________________________

Posted on the Www.supercluster.org mailing list. Go to http://www.supercluster.org/mailman/listinfo/torqueusers to subscribe.

#8

07-06-2010 07:10 PM

User

Default Search the Www.supercluster.org archives:


On Mon, Jun 7, 2010 at 11:02 AM, Ken Nielson
> On 06/04/2010 08:14 PM, Glen Beane wrote:
>> On Fri, Jun 4, 2010 at 5:37 PM, David Singleton
>>
>>
>>> If procs is going to mean processors/cpus then I would suggest there needs
>>> to be a lot of code added to align nodes and procs - they are specifying
>>> the same thing.
>>>
>>
>> Moab treats them the same if you do not specify ppn with your nodes
>> request, however TORQUE is pretty much unaware of what -l procs=X
>> means - it just passes the info along to Moab. I would like to see
>> procs become a real torque resource that means give me X total
>> processors on anywhere from 1 to X nodes.
>> _______________________________________________
___________________________________________________

Posted on the Www.supercluster.org mailing list. Go to http://www.supercluster.org/mailman/listinfo/torqueusers to subscribe.

#9

07-06-2010 07:29 PM

User

Default Search the Www.supercluster.org archives:


On Mon, Jun 7, 2010 at 11:21 AM, Ken Nielson
> On 06/07/2010 09:10 AM, Glen Beane wrote:
>> On Mon, Jun 7, 2010 at 11:02 AM, Ken Nielson
>>
>>> On 06/04/2010 08:14 PM, Glen Beane wrote:
>>>
>>>> On Fri, Jun 4, 2010 at 5:37 PM, David Singleton
>>>>
>>>>
>>>>
>>>>> If procs is going to mean processors/cpus then I would suggest there needs
>>>>> to be a lot of code added to align nodes and procs - they are specifying
>>>>> the same thing.
>>>>>
>>>>>
>>>> Moab treats them the same if you do not specify ppn with your nodes
>>>> request, however TORQUE is pretty much unaware of what -l procs=X
>>>> means - it just passes the info along to Moab. I would like to see
>>>> procs become a real torque resource that means give me X total
>>>> processors on anywhere from 1 to X nodes.
>>>> _______________________________________________
___________________________________________________

Posted on the Www.supercluster.org mailing list. Go to http://www.supercluster.org/mailman/listinfo/torqueusers to subscribe.

#10

09-06-2010 01:29 AM

User

Default Search the Www.supercluster.org archives:


On Tue, Jun 8, 2010 at 10:36 AM, Ken Nielson
> On 06/07/2010 05:34 PM, David Singleton wrote:
>> On 06/08/2010 01:50 AM, Ken Nielson wrote:
>>
>>> On 06/07/2010 09:29 AM, Glen Beane wrote:
>>>
>>>> On Mon, Jun 7, 2010 at 11:21 AM, Ken Nielson
>>>>
>>>>
>>>>> On 06/07/2010 09:10 AM, Glen Beane wrote:
>>>>>
>>>>>
>>>>>> On Mon, Jun 7, 2010 at 11:02 AM, Ken Nielson
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On 06/04/2010 08:14 PM, Glen Beane wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Fri, Jun 4, 2010 at 5:37 PM, David Singleton
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> If procs is going to mean processors/cpus then I would suggest there needs
>>>>>>>>> to be a lot of code added to align nodes and procs - they are specifying
>>>>>>>>> the same thing.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Moab treats them the same if you do not specify ppn with your nodes
>>>>>>>> request, however TORQUE is pretty much unaware of what -l procs=X
>>>>>>>> means - it just passes the info along to Moab. I would like to see
>>>>>>>> procs become a real torque resource that means give me X total
>>>>>>>> processors on anywhere from 1 to X nodes.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Currently Moab interprets procs to mean give me all the processors on X
>>>>>>> nodes.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> that doesn't seem correct. I use procs all the time and I do not get
>>>>>> this behavior from Moab (I've tried it with 5.3 and 5.4). The
>>>>>> behavior I expect and see is for Moab to give me X total processors
>>>>>> spread across any number of nodes (the processors could all be on the
>>>>>> same node, or they could be spread across many nodes depending on what
>>>>>> is free at the time the job is scheduled to run).
>>>>>>
>>>>>>
>>>>>>
>>>>> Glen
>>>>>
>>>>> Try doing a qsub -l proces=1. Then do a qstat -f and see what
>>>>> the exec_host is set to.
>>>>>
>>>>> I am running Moab 5.4.
>>>>>
>>>>>
>>>>>
>>>> you must have some TORQUE defaults set, like ncpus that are
>>>> interfering with procs. Since -l procs does not set ncpus, your
>>>> default is getting applied.
>>>>
>>>> gbeane@wulfgar:~> echo "sleep 60" | qsub -l procs=1,walltime=00:01:00
>>>> 69760.wulfgar.jax.org
>>>> qstat -f 69760
>>>> ...
>>>> exec_host = cs-short-2/0
>>>> ...
>>>>
>>>>
>>> Glen,
>>>
>>> You are right. I set those on my last set of problems with syntax.
>>> Ironically they did not affect those resources.
>>>
>>> Ken
>>>
>>
>> I rest my case.
>>
>>
>> We treat ncpus as moab appears to treat procs. But the server also
>> aligns ncpus and nodes requests, eg.
>>
>> vayu2:~> qsub -lncpus=4 -h
>> w
>> 194363.vu-pbs
>> vayu2:~> qstat -f 194363
>> Job Id: 194363.vu-pbs
>> ...
>> Resource_List.ncpus = 4
>> Resource_List.neednodes = 4:ppn=1
>> Resource_List.nodect = 4
>> Resource_List.nodes = 4:ppn=1
>> ...
>>
>> vayu2:~> qsub -lnodes=1:ppn=4 -h
>> w
>> 194365.vu-pbs
>> vayu2:~> qstat -f 194365
>> Job Id: 194365.vu-pbs
>> ...
>> Resource_List.ncpus = 4
>> Resource_List.neednodes = 1:ppn=4
>> Resource_List.nodect = 1
>> Resource_List.nodes = 1:ppn=4
>> ...
>>
>> Any resource limits or defaults really apply to both ncpus (procs) and
>> nodes.
>>
>> David
>>
>>
>> _______________________________________________
___________________________________________________

Posted on the Www.supercluster.org mailing list. Go to http://www.supercluster.org/mailman/listinfo/torqueusers to subscribe.

#11

09-06-2010 03:50 AM

User

Default Search the Www.supercluster.org archives:


On Jun 8, 2010, at 6:37 PM, David Singleton

> On 06/09/2010 07:29 AM, Glen Beane wrote:
>>
>> in my opinion JOBNODEMATCHPOLICY EXACTNODE should now be the default
>> behavior since we have -l procs. If I ask for 5 node and 8
>> processors
>> per node then that is what I should get. I don't want 10 nodes with 4
>> processors or 2 nodes with 16 processors and 1 with 8, etc. If
>> people
>> don't care about the layout of their job they can use -l procs.
>> hopefully with select things will be less ambiguous and will allow
>> for
>> greater flexibility (let the user be precise as they want, but also
>> allow some way to say I don't care, just give me X processors).
>
> Our experience is that very few users want detailed control over
> exactly
> how many physical nodes they get - it seems to be only comp sci
> students
> or similar with mistaken ideas about the value of such control. They
> dont seem to realise that when they demand 1 cpu from each of 16
> nodes,
> variability in what is running on the other cpus on those nodes will
> make
> a mockery of any performance numbers they deduce. Other reasons for
> requesting exact nodes are usually to do with another resource
> (memory,
> network interfaces, GPUs, ...). It should be requests for those
> resources/
> node properties that get what the user wants, not the number of nodes.
>
> We certainly have more users with hydrid MPI-OpenMP codes and for
> them,
> nodes are really "virtual nodes", eg. a request for -lnodes=8:ppn=4
> means
> the job will be running with 8 MPI tasks each of which will have 4
> threads -
> the job needs any (the best?) set of cpus that can run that. A 32P
> SMP
> might a perfectly acceptable solution.
>
> I suspect hybrid codes will become more common.
>
> So I would suggest EXACTNODE should not be the default but rather that
> users thinking they want such detailed control should have to
> specify some
> other option to show this (eg. -lother=exactnodes), ie. nodes are
> "virtual nodes" unless the user specifies otherwise.
>
>>
>> Also, the documentation should be clear that when you request a
>> number
>> of processors per node (ppn) or a number of processors (procs) it is
>> talking about virtual processors as configured in pbs_server
>
> True.
>
> Note that virtual processors = physical processors causes a number of
> problems.

Generally virtual processors is set to the number of CPUs * cores per
CPU. Some (but not all) users know each processor has multiple core
and think that by requesting the processor they are allocated all
cores of the processor.

_______________________________________________
___________________________________________________

Posted on the Www.supercluster.org mailing list. Go to http://www.supercluster.org/mailman/listinfo/torqueusers to subscribe.

#12

09-06-2010 03:54 AM

User

Default Search the Www.supercluster.org archives:


On Jun 8, 2010, at 6:37 PM, David Singleton

> On 06/09/2010 07:29 AM, Glen Beane wrote:
>>
>> in my opinion JOBNODEMATCHPOLICY EXACTNODE should now be the default
>> behavior since we have -l procs. If I ask for 5 node and 8
>> processors
>> per node then that is what I should get. I don't want 10 nodes with 4
>> processors or 2 nodes with 16 processors and 1 with 8, etc. If
>> people
>> don't care about the layout of their job they can use -l procs.
>> hopefully with select things will be less ambiguous and will allow
>> for
>> greater flexibility (let the user be precise as they want, but also
>> allow some way to say I don't care, just give me X processors).
>
> Our experience is that very few users want detailed control over
> exactly
> how many physical nodes they get - it seems to be only comp sci
> students
> or similar with mistaken ideas about the value of such control. They
> dont seem to realise that when they demand 1 cpu from each of 16
> nodes,
> variability in what is running on the other cpus on those nodes will
> make
> a mockery of any performance numbers they deduce. Other reasons for
> requesting exact nodes are usually to do with another resource
> (memory,
> network interfaces, GPUs, ...). It should be requests for those
> resources/
> node properties that get what the user wants, not the number of nodes.
>
> We certainly have more users with hydrid MPI-OpenMP codes and for
> them,
> nodes are really "virtual nodes", eg. a request for -lnodes=8:ppn=4
> means
> the job will be running with 8 MPI tasks each of which will have 4
> threads -
> the job needs any (the best?) set of cpus that can run that. A 32P
> SMP
> might a perfectly acceptable solution.
>

Select takes care of this. You request 8 task with 4 virtual procs per
task. The scheduler can co-locate tasks. However if I go through the
trouble of requesting a specific number of nodes then I should get them.


Replying from my phone so ignore the rest of this email. It is a pain
to delete what I'm not commenting on.


> I suspect hybrid codes will become more common.
>
> So I would suggest EXACTNODE should not be the default but rather that
> users thinking they want such detailed control should have to
> specify some
> other option to show this (eg. -lother=exactnodes), ie. nodes are
> "virtual nodes" unless the user specifies otherwise.
>
>>
>> Also, the documentation should be clear that when you request a
>> number
>> of processors per node (ppn) or a number of processors (procs) it is
>> talking about virtual processors as configured in pbs_server
>
> True.
>
> Note that virtual processors = physical processors causes a number of
> problems. Certainly cpuset-aware MOMs are going to barf with such a
> setup
> and the problem is that they dont know this is the config, only the
> server
> and scheduler do. It sorta makes sense for the number of virtual
> processors
> to be set in the MOM's config file so it can shut down NUMA/cpuset/
> binding
> code when it doesn't make sense.
>
> Cheers,
> David
>
> _______________________________________________
___________________________________________________

Posted on the Www.supercluster.org mailing list. Go to http://www.supercluster.org/mailman/listinfo/torqueusers to subscribe.
Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r2 - 2011-09-16 - KanBowen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback