Print

Print


Hi Tommaso,

Indeed, that was my thinking originally but I thought I'd follow the 
conservative path. What this means is that these WN jobs, while connecting 
to the machine xrootd is running on, are using the cmsd port number (I 
assume you are using 1094 for xrootd and something else for cmsd). That 
would indicate a job configuration error of some kind. For instance, some 
jobs are getting the right host but the wrong port set for them.

One thing to look at (and send to me) is the corresponding xrootd log file 
(i.e. the same time period when the cmsd reports the error). Perhaps that 
will tell us what is going on.

Andy

On Fri, 19 Feb 2016, Tommaso Boccali wrote:

> Just got confirmation that these are WN, and do not have Xrootd-server
> installed at all.
> For what concerns the client version, it depends on CMSSW externals, but
> this means it is the same in all the sites, and thus the "invalid login"
> only for a fraction of them is really unexplained....
>
> tom
>
> On Fri, Feb 19, 2016 at 9:10 AM, Tommaso Boccali <[log in to unmask]>
> wrote:
>
>> ciao andrew, are you sure these are connections from servers and not from
>> clients? I Ask since their names are suspicious:
>>
>> 160219 08:53:38 16932 Pup: buffer overrun unpacking short arg 0: ident.
>> 160219 08:53:38 16932 Login: sbgwn20.in2p3.fr login failed; invalid login
>> data
>>
>> 160219 08:57:02 16934 Pup: buffer overrun unpacking short arg 0: ident.
>> 160219 08:57:02 16934 Login: gw1.cis.gov.pl login failed; invalid login
>> data
>>
>> 160219 08:45:08 64858 Pup: buffer overrun unpacking short arg 0: ident.
>> 160219 08:45:08 64858 Login: sbgwn12.in2p3.fr login failed; invalid login
>> data
>>
>> ...
>> 160219 08:56:34 24733 Login: g28n03.hep.wisc.edu login failed; invalid
>> login data
>> ...
>>
>> they seem WN names, not server names ....
>>
>> still, I am asking them
>>
>> tom
>>
>> On Thu, Feb 18, 2016 at 10:25 PM, Andrew Hanushevsky <[log in to unmask]>
>> wrote:
>>
>>> Hi Tommaso,
>>>
>>> Yes, you should care. These sites are not joining your cluster as they
>>> cannot login. Could you tell me what version the sites that are getting the
>>> errors running?
>>>
>>> Andy
>>>
>>> *From:* Tommaso Boccali <[log in to unmask]>
>>> *Sent:* Thursday, February 18, 2016 8:12 AM
>>> *To:* Andrew Hanushevsky <[log in to unmask]>
>>> *Cc:* Gerard Bernabeu <[log in to unmask]> ; [log in to unmask] ; Marian
>>> Zvada <[log in to unmask]> ; Jan Iven <[log in to unmask]>
>>> *Subject:* Re: problem in transitioning a redirector from 3.3.6 to 4.2.2
>>>
>>> ciao, coming back to this, a few months later (on 4.2.3)
>>>
>>> i still see TONS of
>>>
>>> 160218 17:08:17 42171 Pup: buffer overrun unpacking short arg 0: ident.
>>> 160218 17:08:17 42171 Login: gridlink.hephy.oeaw.ac.at login failed;
>>> invalid login data
>>> ...
>>> 160218 17:07:52 42118 Login: grid-wn080.physik.rwth-aachen.de login
>>> failed; invalid login data
>>> ...
>>> 160218 17:04:06 40163 Login: fw-nat-inside-outside.gridka.de login
>>> failed; invalid login data
>>> ...
>>> 160218 16:53:56 25501 Login: wna033.jinr-t1.ru login failed; invalid
>>> login data
>>>
>>>
>>> in cmsd.log
>>>
>>> not sure it has any bad effect ... but: should we care?
>>>
>>> this is at least 1 Hz, and comes form multiple sites ....
>>>
>>>
>>> tom
>>>
>>> On Fri, Sep 4, 2015 at 7:55 AM, Andrew Hanushevsky <[log in to unmask]
>>>> wrote:
>>>
>>>> Hi Tommaso,
>>>>
>>>> You mentioned that the fnal.goc addresses are worer nodes. Why are they
>>>> connecting to the cmsd?
>>>>
>>>> Andy
>>>>
>>>>
>>>> On Fri, 4 Sep 2015, Tommaso Boccali wrote:
>>>>
>>>> By the way, yesterday i upgraded the eu redir to 423. Seems to work fine,
>>>>> even if the statistics is less than 1 day for the moment....
>>>>>
>>>>> Tom
>>>>> Il 04/set/2015 01:33 AM, "Gerard Bernabeu" <[log in to unmask]> ha
>>>>> scritto:
>>>>>
>>>>> the fnal.gov address is from a WorkerNode (probably running a CMS job).
>>>>>>
>>>>>> Gerard
>>>>>>
>>>>>> On Thu, Sep 3, 2015 at 4:54 PM, Andrew Hanushevsky <
>>>>>> [log in to unmask]>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Tommaso,
>>>>>>>
>>>>>>> What are fw-nat-inside-outside.gridka.de and cmswn2148.fnal.gov? The
>>>>>>> message clearly shows that whatever they sent over was incorrect. Yes,
>>>>>>> 4.2.2 would crash in this case, sigh.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> On Wed, 26 Aug 2015, Tommaso Boccali wrote:
>>>>>>>
>>>>>>> ciao, another piece of info which might be interesting:
>>>>>>>>
>>>>>>>> I was looking into the bari eu redir, which uses xrootd
>>>>>>>>
>>>>>>>> xrootd-4.1.1-1.el5
>>>>>>>>
>>>>>>>> the cmsd.log has TONS of messages like
>>>>>>>>
>>>>>>>> 150826 05:18:00 30442 XrdInet: Accepted connection from
>>>>>>>> [log in to unmask]
>>>>>>>> 150826 05:18:00 30442 ?:[log in to unmask] XrdPoll:
>>>>>>>> FD
>>>>>>>>
>>>>>>> 90
>>>>>>>
>>>>>>>> attached to poller 0; num=23
>>>>>>>> 150826 05:18:00 30442 Pup: buffer overrun unpacking short arg 0:
>>>>>>>> ident.
>>>>>>>> 150826 05:18:00 30442 Login: fw-nat-inside-outside.gridka.de login
>>>>>>>>
>>>>>>> failed;
>>>>>>>
>>>>>>>> invalid login data
>>>>>>>> 150826 05:18:00 30442 ?:[log in to unmask] XrdPoll:
>>>>>>>> FD
>>>>>>>>
>>>>>>> 90
>>>>>>>
>>>>>>>> detached from poller 0; num=22
>>>>>>>>
>>>>>>>> from many servers, most from FNAL
>>>>>>>>
>>>>>>>> 150826 21:41:28 3396 Login: cmswn2148.fnal.gov login failed; invalid
>>>>>>>>
>>>>>>> login
>>>>>>>
>>>>>>>> data
>>>>>>>> 150826 21:41:28 3436 Login: cmswn2146.fnal.gov login failed; invalid
>>>>>>>>
>>>>>>> login
>>>>>>>
>>>>>>>> data
>>>>>>>> 150826 21:41:35 3461 Login: cmswn2131.fnal.gov login failed; invalid
>>>>>>>>
>>>>>>> login
>>>>>>>
>>>>>>>> data
>>>>>>>> 150826 21:41:36 2475 Login: cmswn2158.fnal.gov login failed; invalid
>>>>>>>>
>>>>>>> login
>>>>>>>
>>>>>>>> data
>>>>>>>> 150826 21:41:40 3461 Login: cmswn2150.fnal.gov login failed; invalid
>>>>>>>>
>>>>>>> login
>>>>>>>
>>>>>>>> data
>>>>>>>> 150826 21:41:45 3458 Login: cmswn2160.fnal.gov login failed; invalid
>>>>>>>>
>>>>>>> login
>>>>>>>
>>>>>>>> data
>>>>>>>> 150826 21:41:47 3396 Login: cmswn2131.fnal.gov login failed; invalid
>>>>>>>>
>>>>>>> login
>>>>>>>
>>>>>>>> data
>>>>>>>> 150826 21:41:50 3461 Login: cmswn2140.fnal.gov login failed; invalid
>>>>>>>>
>>>>>>> login
>>>>>>>
>>>>>>>> data
>>>>>>>> 150826 21:41:56 3458 Login: cmswn2147.fnal.gov login failed; invalid
>>>>>>>>
>>>>>>> login
>>>>>>>
>>>>>>>> data
>>>>>>>>
>>>>>>>> apparently, we did not notice since 4.1.1-1 does not crash as 4.2.2,
>>>>>>>> but
>>>>>>>> moves along ...
>>>>>>>>
>>>>>>>> tom
>>>>>>>>
>>>>>>>> On Tue, Aug 25, 2015 at 9:07 PM, Marian Zvada <[log in to unmask]>
>>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On 8/25/15 11:58 AM, Tommaso Boccali wrote:
>>>>>>>>>
>>>>>>>>> Well, but: isn't th global redir only subscribed by regional redirs
>>>>>>>>>>
>>>>>>>>> (so
>>>>>>>
>>>>>>>> not many)?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> you're right, I neglected this fact (outsmarted myself ;))...
>>>>>>>>>
>>>>>>>>> Probably eu redirs are the most connected, with close to 64 cmsd
>>>>>>>>>
>>>>>>>>>> entering... It s just normal we saw the problem there.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> ok, this is alarming and we should revise current setup and
>>>>>>>>> introduce
>>>>>>>>>
>>>>>>>> more
>>>>>>>
>>>>>>>> redirectors if needed in EU. Btw, I recently talked with Andy about
>>>>>>>>>
>>>>>>>> this -
>>>>>>>
>>>>>>>> it looks much more promising way to handle 64 limits - to think about
>>>>>>>>> supervisors:
>>>>>>>>>
>>>>>>>>> http://xrootd.org/doc/dev42/cms_config.htm#_Toc405927050
>>>>>>>>>
>>>>>>>>> I'm going to do this in transitional federation where there is one
>>>>>>>>>
>>>>>>>> global
>>>>>>>
>>>>>>>> redirector for all T3s and then those subscribers who will be kicked
>>>>>>>>>
>>>>>>>> off
>>>>>>>
>>>>>>>> from production federation and subscribed there instead.
>>>>>>>>>
>>>>>>>>> -Marian
>>>>>>>>>
>>>>>>>>> Ifca said it has 336-1, which is fairly common. I guess it cannot be
>>>>>>>>>
>>>>>>>> due
>>>>>>>
>>>>>>>> to (just) the release....
>>>>>>>>>>
>>>>>>>>>> Andy, did you understand the source of the bad Iogin data? Is it
>>>>>>>>>> worth
>>>>>>>>>> trying and debugging it?
>>>>>>>>>>
>>>>>>>>>> Tom
>>>>>>>>>>
>>>>>>>>>> Il 25/ago/2015 06:21 PM, "Jan Iven" <[log in to unmask]
>>>>>>>>>> <mailto:[log in to unmask]>> ha scritto:
>>>>>>>>>>
>>>>>>>>>>     On 08/25/2015 05:56 PM, Marian Zvada wrote:
>>>>>>>>>>
>>>>>>>>>>         Hi Tom,
>>>>>>>>>>
>>>>>>>>>>     [..]
>>>>>>>>>>
>>>>>>>>>>         yeah, that is my guess too, but then we have global
>>>>>>>>>>
>>>>>>>>> redirectors
>>>>>>>
>>>>>>>>         at CERN
>>>>>>>>>>         running 4.2.2 dealing with hell lot of cmsd subscriptions
>>>>>>>>>> so
>>>>>>>>>>
>>>>>>>>> I'd
>>>>>>>
>>>>>>>>         expect
>>>>>>>>>>         some visible trouble there as well. So maybe we're lucky
>>>>>>>>>> there
>>>>>>>>>>         too so
>>>>>>>>>>         far... (I believe that autorestart of cmsd if it crashes is
>>>>>>>>>> disabled
>>>>>>>>>>         there, Jan?)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     No, the CMS global redirectors are on CC7, and will
>>>>>>>>>> auto-restart
>>>>>>>>>>     cmsd on "unclean" exit (Restart=on-abort).  I hope that SEGV
>>>>>>>>>>
>>>>>>>>> counts
>>>>>>>
>>>>>>>>     as such...
>>>>>>>>>>
>>>>>>>>>>     Not sure whether we'd even notice the occasional restart,
>>>>>>>>>> unless
>>>>>>>>>>     another tool (abrt) picks this up.
>>>>>>>>>>
>>>>>>>>>>     Cheers
>>>>>>>>>>     jan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Tommaso Boccali
>>>>>>>> INFN Pisa
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> ########################################################################
>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>
>>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Gerard Bernabeu Altayó*
>>>>>> Deputy Department Head
>>>>>>
>>>>>> Distributed Computing Services Operations
>>>>>> Fermi National Accelerator Laboratory
>>>>>> 630 840 6509 office
>>>>>> www.fnal.gov
>>>>>>
>>>>>>
>>>
>>>
>>> --
>>> Tommaso Boccali
>>> INFN Pisa
>>>
>>
>>
>>
>> --
>> Tommaso Boccali
>> INFN Pisa
>>
>
>
>
> -- 
> Tommaso Boccali
> INFN Pisa
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1