cmsd.log file in one of the supervisor nodes has these lines and the last line is repeating in every 10 minutes.
…...
Config round robin scheduling in effect.
------ [log in to unmask] phase 2 supervisor initialization completed.
150401 11:18:26 12180 Start: Waiting for primary server to login.
------ cmsd [log in to unmask]:53216 initialization completed.
150401 11:18:28 12179 Inet: Accepted connection from 19@localhost
150401 11:18:28 12342 Admin_Login initial request: 'login p 12325 port 31094'
150401 11:18:28 12342 Update FrontEnd Parm1=1 Parm2=31094
150401 11:18:28 12342 do_Login:: Primary server 12325 logged in; data port is 31094
150401 11:18:28 12182 Pander supervisor services to cmsxrootd.hep.wisc.edu:1213
150401 11:18:28 12182 Pander trying to connect to lvl 0 cmsxrootd.hep.wisc.edu:1213
150401 11:18:28 12179 Protocol: redirector.12325:19@localhost logged in.
150401 11:18:28 12179 Admit_Redirector redirector.12325:19@localhost assigned slot 1
150401 11:18:28 12182 Add cmsxrootd.hep.wisc.edu to manager config; id=0
150401 11:18:28 12182 ManTree: Now connected to 1 root node(s)
150401 11:18:28 12182 Protocol: Logged into cmsxrootd
150401 11:18:36 12163 Update Stage Parm1=-1 Parm2=0
150401 11:18:36 12163 Update Active Parm1=-1 Parm2=0
150401 11:18:36 12163 Config: supervisor service enabled.
150401 11:18:36 12343 State: Status changed to suspended + nostaging
150401 11:18:36 12343 Inform cmsxrootd.hep.wisc.edu status
150401 11:25:21 12182 Dispatch manager.0:24@cmsxrootd for usage dlen=0
150401 11:25:21 12182 Report_Usage cpu=0 net=0 xeq=0 mem=0 pag=0 dsk=0 0
150401 11:35:21 12182 Dispatch manager.0:24@cmsxrootd for usage dlen=0
150401 11:35:21 12182 Report_Usage cpu=0 net=0 xeq=0 mem=0 pag=0 dsk=0 0
150401 11:45:21 12182 Dispatch manager.0:24@cmsxrootd for usage dlen=0…….
xrootd.log file in the same supervisor node has these lines:
150401 13:57:43 12509 cmsprod.1364:7@cron01 XrootdProtocol: stalling client for 10 sec
150401 13:57:53 16998 cmsprod.1364:7@cron01 XrootdProtocol: stalling client for 10 sec
150401 13:58:03 12351 cmsprod.1364:7@cron01 XrootdProtocol: stalling client for 10 sec
150401 13:58:13 12329 cmsprod.1364:7@cron01 XrootdProtocol: stalling client for 10 sec
150401 13:58:23 16998 cmsprod.1364:7@cron01 XrootdProtocol: stalling client for 10 sec
On one of the redirectors, I see a bunch of messages in cmsd.log file that seems to be a problem. At the end cmsd crashed on this redirector.
150401 14:02:38 26739 Add server.17763:84@g18n08 redirected; too many subscribers.
150401 14:02:38 26744 Add server.4824:83@g26n21 redirected; too many subscribers.
150401 14:02:38 26750 Add server.36654:84@g27n29 redirected; too many subscribers.
150401 14:02:38 26746 Add server.7037:83@g18n13 redirected; too many subscribers.
150401 14:02:38 26742 Add server.6425:84@g10n13 redirected; too many subscribers.
150401 14:02:38 26739 Add server.32081:86@g19n25 redirected; too many subscribers.
150401 14:02:38 26744 Add server.5727:83@g26n26 redirected; too many subscribers.
150401 14:02:38 26750 Add server.585:85@g12n05 redirected; too many subscribers.
150401 14:02:38 26746 Protocol: g14n27 has not yet found a cluster slot!
150401 14:02:38 26746 Add server.19666:83@g14n27 redirected; too many subscribers.
150401 14:02:38 26742 Add server.32464:84@g18n05 redirected; too many subscribers.
150401 14:02:38 26744 Add server.24132:83@g10n10 redirected; too many subscribers.
150401 14:02:38 26739 Add server.9156:84@g14n18 redirected; too many subscribers.
150401 14:02:38 26806 Add server.11709:81@g14n02 redirected; too many subscribers.
…..
150401 14:02:41 26405 XrdPoll: Unable to exclude link server.4297:82@g26n29; bad file descriptor
150401 14:02:41 26405 XrdPoll: Sever event occured for server.19044:76@g20n04
150401 14:02:41 26405 XrdPoll: Unable to exclude link server.19044:76@g20n04; bad file descriptor
150401 14:02:41 26406 XrdPoll: Sever event occured for server.7993:72@g27n12
150401 14:02:41 26406 XrdPoll: Unable to exclude link server.7993:72@g27n12; bad file descriptor
150401 14:02:41 26406 XrdPoll: Sever event occured for server.7993:72@g27n12
150401 14:02:41 26406 XrdPoll: Unable to exclude link server.7993:72@g27n12; bad file descriptor
150401 14:02:41 26406 XrdPoll: Sever event occured for server.7993:72@g27n12
150401 14:02:41 26406 XrdPoll: Unable to exclude link server.7993:72@g27n12; bad file descriptor
Let me know if you need more information.
Thanks for any help.
-Tapas