Tags:
create new tag
view all tags

June

6.1~6.8 (传输正常->分析作业应用错误率高->creamce两天不稳定)

  • 6.1 大量wangxy分析作业configuration error->用户自己的配置问题。
  • 6.3-6.4 blparser不稳定,4日jobRobot 16% -> 由top BDII引起。
  • 6.7 347个analysis jobs aborted ->60302, no output log files found->cmsRun未正常结束,用户在cms的配置和crab配置的输出文件不一致。
  • 周作业总结:
(1)Test作业:   JoRobot 作业6月3号成功70%,4号成功16%,原因blparser错误。
(2)作业总数:16879, 成功率67%。主要是分析作业,生产作业较少。
(3)生产作业100%成功。
(4)分析作业错误率50%以上。
(5)错误原因主要:grid错误12%;application错误25%。Application错误主要有:configuration错误1800;kill错误900;fail to copy to output file 1000, output file not found 800。
本周无file open错误。
(6) 站点引起的12%(grid-creamce),其他都是用户应用错误。
  • 周传输总结:
与各站点链接正常,传入7TB,传出5.6TB

6.9~6.15 (传输正常->量大->分析作业应用错误率16%)

  • 6.10 144 jobs is caused by "output files not found" * 6.13 806 jobs is caused by 8001 "cmssw exception caught" * 6.14 112 jobs is caused by 50115 "cmssw hasn't produce valid job report" * 6.15 211 jobs is caused by 8001 "cmssw exception caught" * 周作业总结:
(1)Test作业:   基本100%,只有nagios在11号92%,SE有1个小时的critical。
(2)作业总数:19,985, 成功率83%。
(3)production: 52/2195(2%)    Analysis:3217/10618(30%)
(4)错误原因主要:其中grid错误1%,application错误16%。Application错误主要有:output files not found(1082), cmssw exception caught(1067), can't copy output files to the SE(358)
其中can't copy output files to the SE集中在11日,与SE短暂的critical有关。
  • 周传输总结:
与各站点链接正常,传入25TB,传出8TB

6.16~6.22 (传输正常->量少->production作业比例大->成功率高)

  • 6.17 facility meeting: (1) Nagios will take place SAM tests (2) HammerClouds will take place jobRobot * 6.20 nagios ce critical 2 hours * 6.21 taojq 作业提交到外部站点,结果无法考回SE,有少量正确拷回,重复提交依然错误。
60317 –  Forced timeout for stuck stage out
60307 -  can't copy output files to the SE  
It means that output files could not be copied to the remote SE in the allocated time
It can either caused due to the output files being very large or due to some nasty network problem. 
It was the network problem and the solution is to resubmit the jobs,  
may be after blacklisting those sites where it failed
Solutions:
1. prove Beijing SE to be OK, the transfer between T2_US_Purdue and T2_CN_Beijing is ok.
2. check users from other sites T2_US_Purdue also have the same errors.
3. check crab hypernews and see 60307 and 60317 are very common errors. The only suggestion is to
further split the event number.
4. check "SRM watch" in SE monitoring to see the real file size.
Main Reason: (1) Most of user files are too big, larger than 13GB, it is risky to transfer so big files
                       (2) Users doesn't know the real size of the data files since data is in other sites and can't be tested
                       (3) Users are not willing to use /store/temp as a temporary solution. 
Notes: Find that CMS crab monitoring and user logs is not consistent in some ways,eg. failed job numbers 
  • 周作业总结:
(1)Test作业:   基本100%
(2)作业总数:14,606, 成功率90%。
(3)production: 27/8458   Analysis:700/1847
(4)错误原因主要:其中grid错误5%,application错误3%。Application错误主要有:CMS exception(241), kill(139), cmssw can't produce report (108)
  • 周传输总结:
与各站点链接正常,传入7TB,传出5.76TB

6.23~6.29 (传输正常->production作业比例大->成功率高)

  • 6.22 Nagios SE critical 5 hours caused by taojq's jobs output files to the SE? * 6.23 lcg CE critical 2 hours /dev/null文件成只读
  • 6.26 vobox,lfc critical 1 hours, 网络短暂连接不上
  • 周作业总结:
(1)Test作业:   22日sam tests 84%, SE 错误; 23日jobrobot作业11%,显示pbs 错误,与lcg CE的错误有关。
(2)作业总数:13,000, 成功率90%。
(3)production: 95/6276   Analysis:270/1313
(4)错误原因主要:其中grid错误9%,application错误1%,grid错误主要有:cancelled 82%, aborted 18%,与测试错误原因有关。application错误主要是CMSSW错误。
  • 周传输总结:
与各站点链接正常,传入7.3TB,传出7.1TB

July

6.30~7.13 (传输正常->成功率正常90%->blparser down)

  • 7.1 PBS调度不正常,CMS只占0.22%,还有许多作业排队,有些时间还比较久了。 主要原因:以前的有一些坏掉的作业挡住新提交的队列了
  • 7.2 分析作业多,错误率低,基本为CMSSW错误
  • 7.3 cce服务 Critical 14小时,blparser 死了
  • 周作业总结:
(1)Test作业:   2日,3日 jobrobot 成功率83%,59%,原因是blparser服务又down了,需要进一步改善。
(2)作业总数:25,683, 成功率90%,grid 5%/application 5%。
(3)production: 13/3193   Analysis:1700/9269
(4)错误原因主要:其中grid错误5%,集中在2日和3日,与blparser服务不正常有关。application错误5%,application错误主要是CMSSW错误(~70%), failed to copy files to the SE (~20%)。
  • 周传输总结:
与各站点链接正常,传入10.5TB,传出7TB

7.14~7.27 (高温假停机->dCache从pnfs到chimena升级->服务恢复正常)

  • 7.15~7.25申请downtime, 升级dCache从pnfs到chimena,比较成功,升级后传输现已恢复正常 但是出现数据丢失,需要重传这些数据。
  • 7.18 Frontier/squid停机重启,恢复正常
  • 7.19 作业系统恢复正常,由于jobrobot的数据在SE升级中收到影响,需要重传,因为会影响几天的jobRobot测试,现象是基本没有jobrobot的作业。

August

7.28~8.3 (数据申请量和传输量大->因升级数据丢失->作业量大)

  • 周作业总结: 7.26~7.27 Production 作业出现错误:Scram Project Command Failed in ProdAgent production job
(1)Test作业:   基本正常,100%
(2)作业总数:42,842, 成功率92%,application 7%。
(3)production: 2360/34000   Analysis:781/3167
(4)错误原因主要:Application 错误主要是:Scram Project Command Failed in ProdAgent production job(2,300)---这个问题在27号出现后自行恢复,没有找到相应原因,发现其他站点在downtime后也发现过类似问题,怀疑是CMS的ProdAgent的问题, CMS exception(348), FIleOpenError(170)
(5)note: 本地用户反映当作业量大时,他们的作业排队时间长,是否可以考虑提升他们的优先级?
  • 周传输总结:
问题:dCache升级后有些数据丢失,加上新的数据申请,这周申请量和传输量都很大,中间Phedex的production模式的FileDownload Agent出现循环错误,每八个小时Down一次。
原因:当数据申请量超过30TB,production模式的Agent和CERN的中心FileRouter的通信时间过长,使其出现堵塞。phedex专家认为是phedex的一个bug,正在解决。
状态:现已恢复正常,传入40TB,传出5TB(传出少,downtime时间很长,没有MC数据的积累)
          debug模式的传输没有收到影响,与各站点的传输连接正常

8.4~8.11 (数据申请量和传输量大->用户作业大量误投->系统堵塞)

  • 8.4 datasets get stuck for more than 5 days when they are close to complete
(1) reason: a bug in phedex. It chose the wrong routes where the site has no complete data. phedex will continue to attempt the unsuccessful transferring until expires (2) solution: it is to subspend the data, wait more than 30 minutes, and unsubspend it * 8.5 a crab developper wrongly submit more than 60,000 jobs to T2_CN_Beijing, which let T2_CN_Beijing overload. This happened in other CMS sites, and had discussions in hypernews. The method suggested is to limit the maxium job number of a single users locally in the site directly contacted the user, we found that he is CRAB developer, and a bug in the JobSubmitter component caused an infinite loop, and such a load for your site.
  • 周作业总结:
(1)Test作业:   5日JobRobot测试80%, 出现CCE连不上错误。
(2)作业总数:16,645, 成功率88%,application 5%, grid 5%。
(3)production: 100/4062   Analysis:250/5243
(4)错误原因主要:(1) 60000作业错误投递造成系统过载 (2)有些模拟作业出现死循环,用完本地硬盘空间
  • 周传输总结:
问题:有些请求数据在传输接近100%的时候,中断了
原因:phedex的routing的一个bug. 它将数据传输的routes定到一个数据不完整的站点。唯一办法让它重新定义routes
状态:传入30TB,传出6.6TB         

8.12~8.18 (数据传输正常->测试作业正常->大量应用错误)

  • 8.11~8.12 some of cms production jobs run in deadly loop. They stopped only when local disk space(more than 300GB) is used up. I have reported to CMS data operations with log files. They have fixed the problem.
  • 周作业总结:
(1)Test作业:   测试作业成功率100%。
(2)作业总数:18,340, 成功率64%,application 13%, grid 23%。
(3)production: 59/100   Analysis:5600/9450
(4)错误原因主要:近期有些模拟作业出现死循环,log逐渐占满本地硬盘空间,对其他作业造成影响。
  • 周传输总结:
传输正常
状态:传入9.3TB,传出11TB         

8.19~8.25 (数据传输正常->测试作业正常)

  • 周作业总结:
(1)Test作业:   测试作业成功率100%。
(2)作业总数:10,773, 成功率80%,application 14%, grid 6%。
(3)production: 472/578   Analysis:1400/3000
(4)错误原因主要:grid 错误主要原因17, 19号都有个别结点Down, Application错误FileOpenError, failed to copy files to SE, output files not found。
  • 周传输总结:
传输正常
状态:传入5.4TB,传出5.02TB         

8.26~9.1 (数据传输正常->测试作业正常)

  • 8.25-8.30 SE, cce有短暂的critical. 有一点影响SAM tests, 在这两天成功率96%
  • 周作业总结:
(1)Test作业:   测试作业成功率100%。
(2)作业总数:17104, 成功率84%,application 11%, grid 5%。
(3)production: 245/5000   Analysis:2100/4750
(4)错误原因主要:grid 错误主要集中在25,27,29号,其中有local batch problems, teminated problem; Application错误FileOpenError,用户配置问题造成。
  • 周传输总结:
传输正常
状态:传入5.8TB,传出5TB         

8.26~9.1 (数据传输正常->测试作业正常)

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r18 - 2011-10-17 - ZhangXiaomei
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback