Tags:
create new tag
view all tags

June

6.1~6.8 £¨´«ÊäÕý³£->·ÖÎö×÷ÒµÓ¦ÓôíÎóÂʸß->creamceÁ½Ìì²»Îȶ¨£©

  • 6.1 ´óÁ¿wangxy·ÖÎö×÷Òµconfiguration error->Óû§×Ô¼ºµÄÅäÖÃÎÊÌâ¡£
  • 6.3-6.4 blparser²»Îȶ¨£¬4ÈÕjobRobot 16% -> ÓÉtop BDIIÒýÆð¡£
  • 6.7 347¸öanalysis jobs aborted ->60302, no output log files found->cmsRunδÕý³£½áÊø£¬Óû§ÔÚcmsµÄÅäÖúÍcrabÅäÖõÄÊä³öÎļþ²»Ò»Ö¡£
  • ÖÜ×÷Òµ×ܽ᣺
(1)Test×÷Òµ:   JoRobot ×÷Òµ6ÔÂ3ºÅ³É¹¦70%£¬4ºÅ³É¹¦16%£¬Ô­Òòblparser´íÎó¡£
(2)×÷Òµ×ÜÊý£º16879£¬ ³É¹¦ÂÊ67%¡£Ö÷ÒªÊÇ·ÖÎö×÷Òµ£¬Éú²ú×÷Òµ½ÏÉÙ¡£
(3)Éú²ú×÷Òµ100%³É¹¦¡£
(4)·ÖÎö×÷Òµ´íÎóÂÊ50%ÒÔÉÏ¡£
(5)´íÎóÔ­ÒòÖ÷Òª£ºgrid´íÎó12%£»application´íÎó25%¡£Application´íÎóÖ÷ÒªÓУºconfiguration´íÎó1800£»kill´íÎó900£»fail to copy to output file 1000, output file not found 800¡£
±¾ÖÜÎÞfile open´íÎó¡£
(6) Õ¾µãÒýÆðµÄ12%£¨grid-creamce)£¬ÆäËû¶¼ÊÇÓû§Ó¦ÓôíÎó¡£
  • ÖÜ´«Êä×ܽ᣺
Óë¸÷Õ¾µãÁ´½ÓÕý³££¬´«Èë7TB£¬´«³ö5.6TB

6.9~6.15 £¨´«ÊäÕý³£->Á¿´ó->·ÖÎö×÷ÒµÓ¦ÓôíÎóÂÊ16%£©

  • 6.10 144 jobs is caused by "output files not found" * 6.13 806 jobs is caused by 8001 "cmssw exception caught" * 6.14 112 jobs is caused by 50115 "cmssw hasn't produce valid job report" * 6.15 211 jobs is caused by 8001 "cmssw exception caught" * ÖÜ×÷Òµ×ܽ᣺
(1)Test×÷Òµ:   »ù±¾100%£¬Ö»ÓÐnagiosÔÚ11ºÅ92%£¬SEÓÐ1¸öСʱµÄcritical¡£
(2)×÷Òµ×ÜÊý£º19,985£¬ ³É¹¦ÂÊ83%¡£
(3)production: 52/2195(2%)    Analysis:3217/10618(30%£©
(4)´íÎóÔ­ÒòÖ÷Òª£ºÆäÖÐgrid´íÎó1%£¬application´íÎó16%¡£Application´íÎóÖ÷ÒªÓУºoutput files not found(1082), cmssw exception caught(1067), can't copy output files to the SE(358)
ÆäÖÐcan't copy output files to the SE¼¯ÖÐÔÚ11ÈÕ£¬ÓëSE¶ÌÔݵÄcriticalÓйء£
  • ÖÜ´«Êä×ܽ᣺
Óë¸÷Õ¾µãÁ´½ÓÕý³££¬´«Èë25TB£¬´«³ö8TB

6.16~6.22 £¨´«ÊäÕý³£->Á¿ÉÙ->production×÷Òµ±ÈÀý´ó->³É¹¦Âʸߣ©

  • 6.17 facility meeting: (1) Nagios will take place SAM tests (2) HammerClouds will take place jobRobot * 6.20 nagios ce critical 2 hours * 6.21 taojq ×÷ÒµÌá½»µ½Íⲿվµã£¬½á¹ûÎÞ·¨¿¼»ØSE,ÓÐÉÙÁ¿ÕýÈ·¿½»Ø,ÖØ¸´Ìá½»ÒÀÈ»´íÎó¡£
60317 ¨C  Forced timeout for stuck stage out
60307 -  can't copy output files to the SE  
It means that output files could not be copied to the remote SE in the allocated time
It can either caused due to the output files being very large or due to some nasty network problem. 
It was the network problem and the solution is to resubmit the jobs,  
may be after blacklisting those sites where it failed
Solutions:
1. prove Beijing SE to be OK, the transfer between T2_US_Purdue and T2_CN_Beijing is ok.
2. check users from other sites T2_US_Purdue also have the same errors.
3. check crab hypernews and see 60307 and 60317 are very common errors. The only suggestion is to
further split the event number.
4. check "SRM watch" in SE monitoring to see the real file size.
Main Reason: (1) Most of user files are too big, larger than 13GB, it is risky to transfer so big files
                       (2) Users doesn't know the real size of the data files since data is in other sites and can't be tested
                       (3) Users are not willing to use /store/temp as a temporary solution. 
Notes: Find that CMS crab monitoring and user logs is not consistent in some ways£¬eg. failed job numbers 
  • ÖÜ×÷Òµ×ܽ᣺
(1)Test×÷Òµ:   »ù±¾100%
(2)×÷Òµ×ÜÊý£º14,606£¬ ³É¹¦ÂÊ90%¡£
(3)production: 27/8458   Analysis:700/1847
(4)´íÎóÔ­ÒòÖ÷Òª£ºÆäÖÐgrid´íÎó5%£¬application´íÎó3%¡£Application´íÎóÖ÷ÒªÓУºCMS exception(241), kill(139), cmssw can't produce report (108)
  • ÖÜ´«Êä×ܽ᣺
Óë¸÷Õ¾µãÁ´½ÓÕý³££¬´«Èë7TB£¬´«³ö5.76TB

6.23~6.29 £¨´«ÊäÕý³£->production×÷Òµ±ÈÀý´ó->³É¹¦Âʸߣ©

  • 6.22 Nagios SE critical 5 hours caused by taojq's jobs output files to the SE? * 6.23 lcg CE critical 2 hours /dev/nullÎļþ³ÉÖ»¶Á
  • 6.26 vobox,lfc critical 1 hours, ÍøÂç¶ÌÔÝÁ¬½Ó²»ÉÏ
  • ÖÜ×÷Òµ×ܽ᣺
(1)Test×÷Òµ:   22ÈÕsam tests 84%, SE ´íÎó£» 23ÈÕjobrobot×÷Òµ11%£¬ÏÔʾpbs ´íÎó£¬Óëlcg CEµÄ´íÎóÓйء£
(2)×÷Òµ×ÜÊý£º13,000£¬ ³É¹¦ÂÊ90%¡£
(3)production: 95/6276   Analysis:270/1313
(4)´íÎóÔ­ÒòÖ÷Òª£ºÆäÖÐgrid´íÎó9%£¬application´íÎó1%£¬grid´íÎóÖ÷ÒªÓУºcancelled 82%, aborted 18%£¬Óë²âÊÔ´íÎóÔ­ÒòÓйء£application´íÎóÖ÷ÒªÊÇCMSSW´íÎó¡£
  • ÖÜ´«Êä×ܽ᣺
Óë¸÷Õ¾µãÁ´½ÓÕý³££¬´«Èë7.3TB£¬´«³ö7.1TB

July

6.30~7.13 £¨´«ÊäÕý³£->³É¹¦ÂÊÕý³£90%->blparser down£©

  • 7.1 PBSµ÷¶È²»Õý³££¬CMSÖ»Õ¼0.22%,»¹ÓÐÐí¶à×÷ÒµÅŶӣ¬ÓÐЩʱ¼ä»¹±È½Ï¾ÃÁË¡£ Ö÷ÒªÔ­Òò£ºÒÔǰµÄÓÐһЩ»µµôµÄ×÷Òµµ²×¡ÐÂÌá½»µÄ¶ÓÁÐÁË
  • 7.2 ·ÖÎö×÷Òµ¶à£¬´íÎóÂʵͣ¬»ù±¾ÎªCMSSW´íÎó
  • 7.3 cce·þÎñ Critical 14Сʱ£¬blparser ËÀÁË
  • ÖÜ×÷Òµ×ܽ᣺
(1)Test×÷Òµ:   2ÈÕ£¬3ÈÕ jobrobot ³É¹¦ÂÊ83%£¬59%£¬Ô­ÒòÊÇblparser·þÎñÓÖdownÁË£¬ÐèÒª½øÒ»²½¸ÄÉÆ¡£
(2)×÷Òµ×ÜÊý£º25,683£¬ ³É¹¦ÂÊ90%£¬grid 5%/application 5%¡£
(3)production: 13/3193   Analysis:1700/9269
(4)´íÎóÔ­ÒòÖ÷Òª£ºÆäÖÐgrid´íÎó5%£¬¼¯ÖÐÔÚ2ÈÕºÍ3ÈÕ£¬Óëblparser·þÎñ²»Õý³£Óйء£application´íÎó5%£¬application´íÎóÖ÷ÒªÊÇCMSSW´íÎó£¨~70%£©, failed to copy files to the SE (~20%)¡£
  • ÖÜ´«Êä×ܽ᣺
Óë¸÷Õ¾µãÁ´½ÓÕý³££¬´«Èë10.5TB£¬´«³ö7TB

7.14~7.27 £¨¸ßμÙÍ£»ú->dCache´Ópnfsµ½chimenaÉý¼¶->·þÎñ»Ö¸´Õý³££©

  • 7.15~7.25ÉêÇëdowntime, Éý¼¶dCache´Ópnfsµ½chimena,±È½Ï³É¹¦,Éý¼¶ºó´«ÊäÏÖÒѻָ´Õý³£ µ«ÊdzöÏÖÊý¾Ý¶ªÊ§£¬ÐèÒªÖØ´«ÕâЩÊý¾Ý¡£
  • 7.18 Frontier/squidÍ£»úÖØÆô£¬»Ö¸´Õý³£
  • 7.19 ×÷ҵϵͳ»Ö¸´Õý³££¬ÓÉÓÚjobrobotµÄÊý¾ÝÔÚSEÉý¼¶ÖÐÊÕµ½Ó°Ï죬ÐèÒªÖØ´«£¬ÒòΪ»áÓ°Ï켸ÌìµÄjobRobot²âÊÔ£¬ÏÖÏóÊÇ»ù±¾Ã»ÓÐjobrobotµÄ×÷Òµ¡£

August

7.28~8.3 £¨Êý¾ÝÉêÇëÁ¿ºÍ´«ÊäÁ¿´ó->ÒòÉý¼¶Êý¾Ý¶ªÊ§->×÷ÒµÁ¿´ó£©

  • ÖÜ×÷Òµ×ܽ᣺ 7.26~7.27 Production ×÷Òµ³öÏÖ´íÎó£ºScram Project Command Failed in ProdAgent production job
(1)Test×÷Òµ:   »ù±¾Õý³££¬100%
(2)×÷Òµ×ÜÊý£º42,842£¬ ³É¹¦ÂÊ92%£¬application 7%¡£
(3)production: 2360/34000   Analysis:781/3167
(4)´íÎóÔ­ÒòÖ÷Òª£ºApplication ´íÎóÖ÷ÒªÊÇ£ºScram Project Command Failed in ProdAgent production job£¨2,300£©---Õâ¸öÎÊÌâÔÚ27ºÅ³öÏÖºó×ÔÐлָ´£¬Ã»ÓÐÕÒµ½ÏàÓ¦Ô­Òò£¬·¢ÏÖÆäËûÕ¾µãÔÚdowntimeºóÒ²·¢ÏÖ¹ýÀàËÆÎÊÌ⣬»³ÒÉÊÇCMSµÄProdAgentµÄÎÊÌâ, CMS exception(348), FIleOpenError(170)
(5)note: ±¾µØÓû§·´Ó³µ±×÷ÒµÁ¿´óʱ£¬ËûÃǵÄ×÷ÒµÅŶÓʱ¼ä³¤£¬ÊÇ·ñ¿ÉÒÔ¿¼ÂÇÌáÉýËûÃǵÄÓÅÏȼ¶£¿
  • ÖÜ´«Êä×ܽ᣺
ÎÊÌ⣺dCacheÉý¼¶ºóÓÐЩÊý¾Ý¶ªÊ§£¬¼ÓÉÏеÄÊý¾ÝÉêÇ룬ÕâÖÜÉêÇëÁ¿ºÍ´«ÊäÁ¿¶¼ºÜ´ó£¬ÖмäPhedexµÄproductionģʽµÄFileDownload Agent³öÏÖÑ­»·´íÎó£¬Ã¿°Ë¸öСʱDownÒ»´Î¡£
Ô­Òò£ºµ±Êý¾ÝÉêÇëÁ¿³¬¹ý30TB£¬productionģʽµÄAgentºÍCERNµÄÖÐÐÄFileRouterµÄͨÐÅʱ¼ä¹ý³¤£¬Ê¹Æä³öÏÖ¶ÂÈû¡£phedexר¼ÒÈÏΪÊÇphedexµÄÒ»¸öbug£¬ÕýÔÚ½â¾ö¡£
״̬£ºÏÖÒѻָ´Õý³££¬´«Èë40TB£¬´«³ö5TB(´«³öÉÙ£¬downtimeʱ¼äºÜ³¤£¬Ã»ÓÐMCÊý¾ÝµÄ»ýÀÛ£©
          debugģʽµÄ´«ÊäûÓÐÊÕµ½Ó°Ï죬Óë¸÷Õ¾µãµÄ´«ÊäÁ¬½ÓÕý³£

8.4~8.11 £¨Êý¾ÝÉêÇëÁ¿ºÍ´«ÊäÁ¿´ó->Óû§×÷Òµ´óÁ¿ÎóͶ->ϵͳ¶ÂÈû£©

  • 8.4 datasets get stuck for more than 5 days when they are close to complete
(1) reason: a bug in phedex. It chose the wrong routes where the site has no complete data. phedex will continue to attempt the unsuccessful transferring until expires (2) solution: it is to subspend the data, wait more than 30 minutes, and unsubspend it * 8.5 a crab developper wrongly submit more than 60,000 jobs to T2_CN_Beijing, which let T2_CN_Beijing overload. This happened in other CMS sites, and had discussions in hypernews. The method suggested is to limit the maxium job number of a single users locally in the site directly contacted the user, we found that he is CRAB developer, and a bug in the JobSubmitter component caused an infinite loop, and such a load for your site.
  • ÖÜ×÷Òµ×ܽ᣺
(1)Test×÷Òµ:   5ÈÕJobRobot²âÊÔ80%£¬ ³öÏÖCCEÁ¬²»ÉÏ´íÎó¡£
(2)×÷Òµ×ÜÊý£º16,645£¬ ³É¹¦ÂÊ88%£¬application 5%, grid 5%¡£
(3)production: 100/4062   Analysis:250/5243
(4)´íÎóÔ­ÒòÖ÷Òª£º£¨1) 60000×÷Òµ´íÎóͶµÝÔì³Éϵͳ¹ýÔØ £¨2£©ÓÐЩģÄâ×÷Òµ³öÏÖËÀÑ­»·£¬ÓÃÍê±¾µØÓ²Å̿ռä
  • ÖÜ´«Êä×ܽ᣺
ÎÊÌ⣺ÓÐЩÇëÇóÊý¾ÝÔÚ´«Êä½Ó½ü100%µÄʱºò£¬ÖжÏÁË
Ô­Òò£ºphedexµÄroutingµÄÒ»¸öbug. Ëü½«Êý¾Ý´«ÊäµÄroutes¶¨µ½Ò»¸öÊý¾Ý²»ÍêÕûµÄÕ¾µã¡£Î¨Ò»°ì·¨ÈÃËüÖØÐ¶¨Òåroutes
״̬£º´«Èë30TB£¬´«³ö6.6TB         

8.12~8.18 £¨Êý¾Ý´«ÊäÕý³£->²âÊÔ×÷ÒµÕý³£->´óÁ¿Ó¦ÓôíÎó£©

  • 8.11~8.12 some of cms production jobs run in deadly loop. They stopped only when local disk space(more than 300GB) is used up. I have reported to CMS data operations with log files. They have fixed the problem.
  • ÖÜ×÷Òµ×ܽ᣺
(1)Test×÷Òµ:   ²âÊÔ×÷Òµ³É¹¦ÂÊ100%¡£
(2)×÷Òµ×ÜÊý£º18,340£¬ ³É¹¦ÂÊ64%£¬application 13%, grid 23%¡£
(3)production: 59/100   Analysis:5600/9450
(4)´íÎóÔ­ÒòÖ÷Òª£º½üÆÚÓÐЩģÄâ×÷Òµ³öÏÖËÀÑ­»·£¬logÖð½¥Õ¼Âú±¾µØÓ²Å̿ռ䣬¶ÔÆäËû×÷ÒµÔì³ÉÓ°Ïì¡£
  • ÖÜ´«Êä×ܽ᣺
´«ÊäÕý³£
״̬£º´«Èë9.3TB£¬´«³ö11TB         

8.19~8.25 £¨Êý¾Ý´«ÊäÕý³£->²âÊÔ×÷ÒµÕý³£)

  • ÖÜ×÷Òµ×ܽ᣺
(1)Test×÷Òµ:   ²âÊÔ×÷Òµ³É¹¦ÂÊ100%¡£
(2)×÷Òµ×ÜÊý£º10,773£¬ ³É¹¦ÂÊ80%£¬application 14%, grid 6%¡£
(3)production: 472/578   Analysis:1400/3000
(4)´íÎóÔ­ÒòÖ÷Òª£ºgrid ´íÎóÖ÷ÒªÔ­Òò17, 19ºÅ¶¼Óиö±ð½áµãDown£¬ Application´íÎóFileOpenError, failed to copy files to SE, output files not found¡£
  • ÖÜ´«Êä×ܽ᣺
´«ÊäÕý³£
״̬£º´«Èë5.4TB£¬´«³ö5.02TB         

8.26~9.1 £¨Êý¾Ý´«ÊäÕý³£->²âÊÔ×÷ÒµÕý³£)

  • 8.25-8.30 SE, cceÓжÌÔݵÄcritical. ÓÐÒ»µãÓ°ÏìSAM tests, ÔÚÕâÁ½Ìì³É¹¦ÂÊ96%
  • ÖÜ×÷Òµ×ܽ᣺
(1)Test×÷Òµ:   ²âÊÔ×÷Òµ³É¹¦ÂÊ100%¡£
(2)×÷Òµ×ÜÊý£º17104£¬ ³É¹¦ÂÊ84%£¬application 11%, grid 5%¡£
(3)production: 245/5000   Analysis:2100/4750
(4)´íÎóÔ­ÒòÖ÷Òª£ºgrid ´íÎóÖ÷Òª¼¯ÖÐÔÚ25,27,29ºÅ£¬ÆäÖÐÓÐlocal batch problems, teminated problem; Application´íÎóFileOpenError£¬Óû§ÅäÖÃÎÊÌâÔì³É¡£
  • ÖÜ´«Êä×ܽ᣺
´«ÊäÕý³£
״̬£º´«Èë5.8TB£¬´«³ö5TB         

8.26~9.1 £¨Êý¾Ý´«ÊäÕý³£->²âÊÔ×÷ÒµÕý³£)

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r18 - 2011-10-17 - ZhangXiaomei
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback