Hi Tapas, Two questions: 1. does your cluster have more than 64 nodes (I mean data servers + supervisors) 2. it seems to me that the supervisor can’t find any data servers to manage. Are you using IPv6? if you are still on IPv4, can you add -I v4 to the command line in cmsd and xrootd? (see /etc/sysconfig/xrootd) note (probably for myself) that 2 may be true if 1 is also true regards, Wei Yang | [log in to unmask] | 650-926-3338(O) On Apr 3, 2015, at 9:45 AM, Tapas Sarangi <[log in to unmask]> wrote: > Hi Wei, > > Thanks for your reply. Here are the fqdn of one data server, one manager and one supervisor : > > server : g26n03.hep.wisc.edu > manager : cmsxrootd.hep.wisc.edu > supervisor : s15n01.hep.wisc.edu > > xrootd config file is same for all these and roles are defined inside the config files. Both cmsd and xrootd depends on the same ‘/etc/xrootd/xrootd.cfg’ file. Please find it in the attachment. > > -Tapas > (T2_Wisconsin admin) > > > <xrootd.cfg> > > >> On Apr 3, 2015, at 11:06 AM, Yang, Wei <[log in to unmask]> wrote: >> >> Hi Tapas, >> >> A number of us are on vacation, including me. I will see if I can find time today or this weekend to look at it. >> >> In the mean time, can you give us >> >> 1. a short list hostnames containing one manager, one supervisor and one data server. >> 2. attach xrootd configuration files from each of the above nodes? >> >> regards, >> Wei Yang | [log in to unmask] | 650-926-3338(O) >> >> >> >> On Apr 3, 2015, at 8:04 AM, Tapas Sarangi <[log in to unmask]> wrote: >> >>> Hello, >>> >>> Any help on this ? >>> >>> Thanks >>> -Tapas >>> >>> >>> >>>> On Apr 1, 2015, at 2:35 PM, Tapas Sarangi <[log in to unmask]> wrote: >>>> >>>> Some more info from the cmsd.log file on one of the servers. I see these repeated messages followed by IPV6 IPs of all the supervisor nodes. >>>> >>>> 150401 14:30:16 30214 Remove completed cmsxrootd.hep.wisc.edu manager 0.257 >>>> 150401 14:30:16 30214 Manager: manager.0:21@cmsxrootd removed; redirected >>>> 150401 14:30:16 30214 Pander trying to connect to lvl 1 [:46222 >>>> 150401 14:30:16 30214 XrdOpen: Unable to create socket for ' [ '; invalid IPv6 address >>>> 150401 14:30:19 30214 Pander trying to connect to lvl 1 [2607:46222 >>>> 150401 14:30:25 30214 Pander trying to connect to lvl 0 cmsxrootd.hep.wisc.edu:1213 >>>> 150401 14:30:25 30214 Add cmsxrootd.hep.wisc.edu to manager config; id=0 >>>> 150401 14:30:25 30214 ManTree: Now connected to 1 root node(s) >>>> 150401 14:30:25 30214 Protocol: Logged into cmsxrootd >>>> 150401 14:30:25 30214 Dispatch manager.0:21@cmsxrootd for try dlen=3587 >>>> 150401 14:30:25 30214 manager.0:21@cmsxrootd do_Try: >>>> >>>> >>>> Appreciate your help. >>>> >>>> -Tapas >>>> >>>> >>>>> On Apr 1, 2015, at 2:12 PM, Tapas Sarangi <[log in to unmask]> wrote: >>>>> >>>>> Dear Xrootd Developers, >>>>> >>>>> After upgrading to xrootd-4.1 OSG32 packaging, we see several problems, mostly supervisor related. First of all, the supervisor nodes have very inactive log files and almost no xrootd traffic. >>>>> >>>>> cmsd.log file in one of the supervisor nodes has these lines and the last line is repeating in every 10 minutes. >>>>> >>>>> …... >>>>> Config round robin scheduling in effect. >>>>> ------ [log in to unmask] phase 2 supervisor initialization completed. >>>>> 150401 11:18:26 12180 Start: Waiting for primary server to login. >>>>> ------ cmsd [log in to unmask]:53216 initialization completed. >>>>> 150401 11:18:28 12179 Inet: Accepted connection from 19@localhost >>>>> 150401 11:18:28 12342 Admin_Login initial request: 'login p 12325 port 31094' >>>>> 150401 11:18:28 12342 Update FrontEnd Parm1=1 Parm2=31094 >>>>> 150401 11:18:28 12342 do_Login:: Primary server 12325 logged in; data port is 31094 >>>>> 150401 11:18:28 12182 Pander supervisor services to cmsxrootd.hep.wisc.edu:1213 >>>>> 150401 11:18:28 12182 Pander trying to connect to lvl 0 cmsxrootd.hep.wisc.edu:1213 >>>>> 150401 11:18:28 12179 Protocol: redirector.12325:19@localhost logged in. >>>>> 150401 11:18:28 12179 Admit_Redirector redirector.12325:19@localhost assigned slot 1 >>>>> 150401 11:18:28 12182 Add cmsxrootd.hep.wisc.edu to manager config; id=0 >>>>> 150401 11:18:28 12182 ManTree: Now connected to 1 root node(s) >>>>> 150401 11:18:28 12182 Protocol: Logged into cmsxrootd >>>>> 150401 11:18:36 12163 Update Stage Parm1=-1 Parm2=0 >>>>> 150401 11:18:36 12163 Update Active Parm1=-1 Parm2=0 >>>>> 150401 11:18:36 12163 Config: supervisor service enabled. >>>>> 150401 11:18:36 12343 State: Status changed to suspended + nostaging >>>>> 150401 11:18:36 12343 Inform cmsxrootd.hep.wisc.edu status >>>>> 150401 11:25:21 12182 Dispatch manager.0:24@cmsxrootd for usage dlen=0 >>>>> 150401 11:25:21 12182 Report_Usage cpu=0 net=0 xeq=0 mem=0 pag=0 dsk=0 0 >>>>> 150401 11:35:21 12182 Dispatch manager.0:24@cmsxrootd for usage dlen=0 >>>>> 150401 11:35:21 12182 Report_Usage cpu=0 net=0 xeq=0 mem=0 pag=0 dsk=0 0 >>>>> 150401 11:45:21 12182 Dispatch manager.0:24@cmsxrootd for usage dlen=0 >>>>> ……. >>>>> >>>>> >>>>> xrootd.log file in the same supervisor node has these lines: >>>>> >>>>> 150401 13:57:43 12509 cmsprod.1364:7@cron01 XrootdProtocol: stalling client for 10 sec >>>>> 150401 13:57:53 16998 cmsprod.1364:7@cron01 XrootdProtocol: stalling client for 10 sec >>>>> 150401 13:58:03 12351 cmsprod.1364:7@cron01 XrootdProtocol: stalling client for 10 sec >>>>> 150401 13:58:13 12329 cmsprod.1364:7@cron01 XrootdProtocol: stalling client for 10 sec >>>>> 150401 13:58:23 16998 cmsprod.1364:7@cron01 XrootdProtocol: stalling client for 10 sec >>>>> >>>>> >>>>> >>>>> On one of the redirectors, I see a bunch of messages in cmsd.log file that seems to be a problem. At the end cmsd crashed on this redirector. >>>>> >>>>> 150401 14:02:38 26739 Add server.17763:84@g18n08 redirected; too many subscribers. >>>>> 150401 14:02:38 26744 Add server.4824:83@g26n21 redirected; too many subscribers. >>>>> 150401 14:02:38 26750 Add server.36654:84@g27n29 redirected; too many subscribers. >>>>> 150401 14:02:38 26746 Add server.7037:83@g18n13 redirected; too many subscribers. >>>>> 150401 14:02:38 26742 Add server.6425:84@g10n13 redirected; too many subscribers. >>>>> 150401 14:02:38 26739 Add server.32081:86@g19n25 redirected; too many subscribers. >>>>> 150401 14:02:38 26744 Add server.5727:83@g26n26 redirected; too many subscribers. >>>>> 150401 14:02:38 26750 Add server.585:85@g12n05 redirected; too many subscribers. >>>>> 150401 14:02:38 26746 Protocol: g14n27 has not yet found a cluster slot! >>>>> 150401 14:02:38 26746 Add server.19666:83@g14n27 redirected; too many subscribers. >>>>> 150401 14:02:38 26742 Add server.32464:84@g18n05 redirected; too many subscribers. >>>>> 150401 14:02:38 26744 Add server.24132:83@g10n10 redirected; too many subscribers. >>>>> 150401 14:02:38 26739 Add server.9156:84@g14n18 redirected; too many subscribers. >>>>> 150401 14:02:38 26806 Add server.11709:81@g14n02 redirected; too many subscribers. >>>>> >>>>> ….. >>>>> >>>>> 150401 14:02:41 26405 XrdPoll: Unable to exclude link server.4297:82@g26n29; bad file descriptor >>>>> 150401 14:02:41 26405 XrdPoll: Sever event occured for server.19044:76@g20n04 >>>>> 150401 14:02:41 26405 XrdPoll: Unable to exclude link server.19044:76@g20n04; bad file descriptor >>>>> 150401 14:02:41 26406 XrdPoll: Sever event occured for server.7993:72@g27n12 >>>>> 150401 14:02:41 26406 XrdPoll: Unable to exclude link server.7993:72@g27n12; bad file descriptor >>>>> 150401 14:02:41 26406 XrdPoll: Sever event occured for server.7993:72@g27n12 >>>>> 150401 14:02:41 26406 XrdPoll: Unable to exclude link server.7993:72@g27n12; bad file descriptor >>>>> 150401 14:02:41 26406 XrdPoll: Sever event occured for server.7993:72@g27n12 >>>>> 150401 14:02:41 26406 XrdPoll: Unable to exclude link server.7993:72@g27n12; bad file descriptor >>>>> >>>>> >>>>> Let me know if you need more information. >>>>> >>>>> Thanks for any help. >>>>> -Tapas >>>>> >>>>> >>>> >>> >>> >>> Use REPLY-ALL to reply to list >>> >>> To unsubscribe from the XROOTD-DEV list, click the following link: >>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1 >>> >> > ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1