Print

Print


Hi Andy,

Thanks very much for convincing us that the traffic is not xrootd 
related.  On closer inspection, we found that this particular node was 
misconfigured to read the underlying lustre FS on the public interface, 
which is why the inbound traffic scaled so perfectly with the outbound 
xrootd traffic.

All the best!

Kyle

On 06/24/2011 04:07 PM, Andrew Hanushevsky wrote:
> Hi Kyle,
>
> The summary counts are consistent. A read request is 24 bytes in length.
> So, we find
>
> 37184778462/1548943168 = 24.007 (as expected)
>
> The average payload in response:
>
> 26610785518162/1548943168 = 17179 (which looks pretty good)
>
> However, this is misleading. Looking at the sample of recvfrom's we find:
>
>             File#    Offset           Length
> 0600 0bc5 00000000 0000000000d74e00 0008a400 = 566272 bytes
> 1f00 0bc5 00000000 0000000000dff200 00003e00 =  15872 bytes
> 0500 0bc5 00000000 0000000000e03000 00004400 =  17408 bytes
> 0500 0bc5 00000000 0000000000e07400 00001400 =   5120 bytes
> 1f00 0bc5 00000000 0000000000e08800 00001400 =   5120 bytes
> 0600 0bc5 00000000 0000000000e09c00 00001400 =   5120 bytes
>
> So, you likely do a lot of small reads with the very large reads pushing
> up the average. That said, I would still suspect the monitoring at this
> point.
>
> Andy
>
> On Fri, 24 Jun 2011, Kyle Fransham wrote:
>
>> Hi Andy,
>>
>> On 06/23/2011 04:28 PM, Andrew Hanushevsky wrote:
>>> Hi Kyle,
>>>
>>> What we need to do is find out what the inbound traffic really is. So, some
>>> more questions:
>>>
>>> 1) On your graph is that bytes in/out or packets in/out. If this is packets
>>> then the graph is likely correct and you are simply doing a lot of very
>>> small reads.
>> That's definitely bytes, not packets on the graph.
>>> 2) To find out a bit more statistics you can connect to the xrootd server
>>> using the xrd command. Do the following:
>>>
>>> xrd<the_xrootd_server>
>>> query 1 lp
>>> exit
>>>
>>> The find the values between<in></in>   and<out></out>   that will give you
>>> number of bytes in and number out. We need to see if that is reasonable
>>> compared to actual requests which you will find between<rd></rd>   and
>>> <wr></wr>   (read/write counters).
>> <rd>1548943168</rd>
>> <wr>0</wr>
>> <in>37184778462</in>
>> <out>26610785518162</out>
>>
>> So writes are 0, as expected.  Inbound traffic is three orders of magnitude
>> less than outbound, which doesn't correspond to our monitoring, but looks
>> good.
>>
>>> 3) If those still seem not to correspond then we can look at the actual
>>> xrootd kernel calls using strace. For instance:
>>>
>>> strace -f -xx -ttt -p<pid>   -e trace=network 2>&1 | grep  'recv('>
>>> <outfile>
>> Okay, so I see no calls to 'recv(' in the strace.  However, I do see calls to
>> 'recvfrom('  that look like this:
>>
>> [pid 16211] 1308923304.005638 recvfrom(19,
>>
> "\x06\x00\x0b\xc5\x00\x00\x00\x00\x00\x00\x00\x00\x00\xd7\x4e\x00\x00\x08\xa4\x00\x00\x00\x00\x00",
>> 24, 0, NULL, NULL) = 24
>> [pid 16211] 1308923304.045490 recvfrom(19,
>> "\x1f\x00\x0b\xc5\x00\x00\x00\x00\x00\x00\x00\x00\x00\xdf\xf2\x00\x00\x00\x3e\x00\x00\x00\x00\x00",
>> 24, 0, NULL, NULL) = 24
>> [pid 16211] 1308923304.046110 recvfrom(19,
>> "\x05\x00\x0b\xc5\x00\x00\x00\x00\x00\x00\x00\x00\x00\xe0\x30\x00\x00\x00\x44\x00\x00\x00\x00\x00",
>> 24, 0, NULL, NULL) = 24
>> [pid 16211] 1308923304.999048 recvfrom(19,
>> "\x05\x00\x0b\xc5\x00\x00\x00\x00\x00\x00\x00\x00\x00\xe0\x74\x00\x00\x00\x14\x00\x00\x00\x00\x00",
>> 24, 0, NULL, NULL) = 24
>> [pid 16211] 1308923304.999260 recvfrom(19,
>> "\x1f\x00\x0b\xc5\x00\x00\x00\x00\x00\x00\x00\x00\x00\xe0\x88\x00\x00\x00\x14\x00\x00\x00\x00\x00",
>> 24, 0, NULL, NULL) = 24
>> [pid 16211] 1308923305.000027 recvfrom(19,
>> "\x06\x00\x0b\xc5\x00\x00\x00\x00\x00\x00\x00\x00\x00\xe0\x9c\x00\x00\x00\x14\x00\x00\x00\x00\x00",
>> 24, 0, NULL, NULL) = 24
>>
>>> This will capture server recv() requests (inbound traffic). No need to run
>>> this more than a minute or two.
>>>
>>> 4) If that doesn't reveal anything then the only other option is that there
>>> really is something else on that machine that is accepting incoming
>>> traffic.
>>> If it isn't udp then netstat should show you who that might be.
>> At a first glance, netstat doesn't show any real flags.  Also, xrootd really
>> is the only thing running on this machine, and we see lots of input whenever
>> a user runs jobs, (i.e. reads data from xrootd) and no input otherwise.  So
>> the two are at least correlated...
>>
>> Thanks for your help!
>>
>> Kyle
>>> Andy
>>>
>>>
>>> -----Original Message-----
>>> From: Kyle Fransham
>>> Sent: Thursday, June 23, 2011 1:05 PM
>>> To: Andrew Hanushevsky
>>> Cc: xrootd-l
>>> Subject: Re: inbound traffic
>>>
>>> Hi Andy,
>>>
>>> This is a machine at UVic that we use to serve BaBar xrootd files to
>>> virtual machines that we spawn in the cloud.  It's running little else
>>> besides xrootd.  Any traffic on the external interface (the plot that I
>>> sent) is xrootd.  We see very high inbound traffic almost all of the time.
>>>
>>> On the back end, we have 10TB or so of data in a lustre filesystem
>>> that's distributed across multiple workers.  Since this is a distributed
>>> filesystem, we expect (and we do see) traffic on the internal interface
>>> that's associated with the reading of xrootd collections.  But we don't
>>> expect to see that externally...
>>>
>>> What else can I tell you about this machine/setup to help diagnose the
>>> problem?
>>>
>>> Thanks,
>>>
>>> Kyle
>>>
>>> On 06/23/2011 03:36 PM, Andrew Hanushevsky wrote:
>>>> Hi Kyle,
>>>>
>>>> There should be little inbound traffic unless that machine is used for
>>>> more
>>>> than just xrootd services. What machine are we talking about?
>>>>
>>>> Andy
>>>>
>>>> -----Original Message-----
>>>> From: Kyle Fransham
>>>> Sent: Thursday, June 23, 2011 7:48 AM
>>>> To: xrootd-l
>>>> Subject: inbound traffic
>>>>
>>>> Hi all,
>>>>
>>>> We've got a single xrootd server serving out BaBar root files over the
>>>> WAN.  We notice that there is a lot of inbound traffic, even though our
>>>> files are exported read-only.  Attached is a network plot showing the
>>>> traffic on the xrootd interface for four simultaneous user analysis
>>>> jobs.  (In case you can't see the attachment, the inboud traffic tends
>>>> to be about 75% of the outbound traffic.)
>>>>
>>>> Is this expected behaviour?
>>>>
>>>> Thanks,
>>>>
>>>> Kyle
>>>>