-- ShiJingyan - 2011-04-15

2011-04-06 (周三)

  • 羊八井计算结点死机频繁,将这个队列的作业加上内存限制,每个作业最多使用2G内存 --- 石京燕

所涉及的队列包括:

argofg argorecq argosq argomq

队列设置命令:

set queue argosq resources_default.pmem = 2046mb

2011-04-15 (周三)

  • 发现计算节点上同时有两个mom进程,导致作业调度不正常。(torqsrv不知道应该和谁联系。),重启pbs_mom,有一个进程不掉如下:
root 5305 1 0 Mar21 ? 00:00:06 /usr/sbin/pbs_mom -q

root 22427 1 0 17:03 ? 00:00:00 /usr/sbin/pbs_mom -p

杀掉5305进程,作业开始调度

  • 今天下午作业调度不正常,很多作业Q着无法运行,未找到具体原因。查看maui日志,似乎是maui无法将作业调度到作业中指定的计算资源。
  • 为了获取更多maui运行信息,将maui的日志级别调到1。 /var/spool/maui/maui.conf文件中修改
LOGLEVEL 1

2011-04-14(周四)

  • torqsrv中,cache过高会影响maui,所以在torqsrv中定时清cache,脚本如下:

* */2 * * * /root/kanbw/clearCache.sh

脚本内容: #!/bin/sh

more /proc/meminfo |grep Cached |grep -v Swap |awk '
{
if ($2>2048000) {
print "The cache is " $2 >> "/var/log/cache.txt"
"date" | getline; print >> "/var/log/cache.txt"
system("`echo 1 > /proc/sys/vm/drop_caches`")
print "clear the cache is completed. " >> "/var/log/cache.txt"
print " " >> "/var/log/cache.txt"
}
}'

2011-04-15 (周五)

为了提高torqsrv的域名解析性能,将所有其管辖的计算结点都在/etc/hosts 里面进行了定义。


This topic: CCSystem/PBS > WebHome > Maintenance > 201104
Topic revision: r2 - 2011-04-16 - ShiJingyan
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback