LISTSERV 16.5 - XROOTD-L Archives

Hi Charles,

2011/5/15 Charles G Waldman <[log in to unmask]>:
> #0  0xffffe430 in __kernel_vsyscall ()
> #1  0xf7712ef2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
> #2  0xef23d426 in XrdSysCondVar::Wait(int) ()   from .../libNetx.so
> #3  0xef19af48 in XrdClient::Read (this=0x16a44840, buf=0x1d70d600,offset=247982068, len=489) at XrdClient.cc:618
> #4  0xef2236d4 in TXNetFile::ReadBuffer(char*, int) ()  from .../libNetx.so
> #5  0xf586f623 in TBasket::ReadBasketBuffers(long long, int, TFile*)()from ../libTree.so
> #6  0xf58783fc in TBranch::GetBasket(int) () from .../libTree.so
>
>
> First off, note that the "XrdClient::Read" is only requisting 489 bytes in this read -
> no read-ahead or caching seems to be in effect.  As is well-known, many short reads
> like this will not perform well across the WAN.  So I am interested in testing the various
> read-ahead parameters, such as
>
>  XNet.ReadAheadSize
>  XNet.ReadAheadStrategy
>  XNet.ReadTrimBlockSize
>  XNet.ReadCacheSize
>
> Note that the system.rootrc for this job has:
>  XNet.ReadAheadSize: 0
>  XNet.ReadCacheSize: 10000000
>  XNet.ParStreamsPerPhyConn: 0

   There issue is a bit complicated here. The job that you are
analyzing is a ROOT job that uses XRootD plugin to access data over
the network. In your stack trace the entry point to XRootD is:
XrdClient::Read, everything above (#x < 3) is XRootD and everything
below (#x > 3) is ROOT. The parameters that you mention end up being
forwarded to the XRootD layer. Now, all the traffic optimization can
be done in two places: in XRootD or in ROOT:

1) When the traffic optimization is supposed to happen in XRootD, you
would see things like a XRootD::Read of len=489, because any external
application (ROOT in this case) just needs to state what data it needs
and the task of figuring out how to fetch it in the most optimal way
is entrusted to XRootD which will optimize the actual network reads.
The tweaking can then be done by the parameters you mention. Since
XRootD is a generic data access software it has no knowledge of the
underlying data file format so all it can do is some statistical guess
work that may be more or less optimal.

2) When the traffic optimization is supposed to happen in ROOT, you
would most probably see xrootd being asked to perform vector reads
(XrdClient::ReadV) of around 30 megs. This is far more optimal because
ROOT knows its file format and can easily predict which parts of the
file it will need in the nearest future so it is able to prefetch data
before it is needed by the application. The parameters here are the
ones that you mention below: tree->SetCacheSize and friends.

> Question 1)
> Why is ReadAheadSize set to 0 here?  And what's the best way to override this?

   Because most probably it is assumed that the optimization will be
done in ROOT and XRootD should not bother to do anything.

>     We'd love to turn it on by default. But the read-ahead needs to put the data somewhere: we need some extra memory; 30 to 100MB are enough. But with the experiments' never ending quest for memory resources turning this on by default is not an option: a 4-core machine with 4 jobs would eat an additional 400MB. Too much.

   Well, if you turn on the read-ahead or any prefetching you need to
store the additional data somewhere and RAM is the easiest target. We
work on some code that will hopefully be committed to ROOT soon which
will enable it to prefetch the data blocks and store them on disk.

>     Instead you need to turn it on by yourself by calling
>
>       tree->SetCacheSize(10000000);
>       tree->AddBranchToCache("*");
>
>
> I don't think we can force users to do this, is there somewhere else in the stack
> that this code could be inserted?

   Not really, since it's dependent on the user data that is being
read. You could argue that it should use some caching by default but
it's debatable.

> Question 2)
>
> Despite the small read-size and 0 read-ahead there may be something else happening as
> well - note from the attached strace output (which covers a time-span of 90 seconds)
> that there8373 messages sent to fd 79 (the xroot server) and no replies recieved.  All of the
> sent messages are 24 bytes in length:
>
>  send(79, "\3\0\v\305\0\0\0\0\0\0\0\0\16\306R\0\0\0\4\0\0\0\0\0", 24, 0) = 24
>
> these are going out at a rate of about 90 per second.   It seems that the messages
> are being sent but no replies recieved - in the logs on the xroot server I see:
>
>  110515 11:47:32 11957 XrootdXeq: usatlas1.12016:[log in to unmask] login
>  110515 11:47:32 11957 usatlas1.12016:[log in to unmask] XrootdProtocol: endsess 11955:61.163901
>  110515 11:47:32 11957 usatlas1.12016:[log in to unmask] XrootdProtocol: endsess 11955:61.163901 rc=-3 (No such process)
>  110515 11:47:32 11957 usatlas1.12016:[log in to unmask] XrootdResponse: sending err 3011: session not found
>
>
> Is this "session not found" the cause of the failures?
>
> Note that when this is happening, the server load is low (<1), the network is not
> congested, etc so there is no reason to be expecting timeouts.

   This is a request to end a session that apparently does not
exist... I will have a closer look. What is your access pattern? Do
you have long standing jobs that keep the connections open for a long
time, or is it more like the jobs fetching the data they need to
process and quiting after the processing is over?

> Question 3)
>
> I also see from strace output that the code is calling 'getrusage' excessively, does this
> really need to be checked 2500 times per second?
>
> Here's a count of system calls made in 90 seconds.
>  232452 getrusage
>  77484 gettimeofday
>  45394 time
>  16750 futex
>  15908 timer_settime
>   9131 clock_gettime
>   8373 send
>   8373 poll
>   1930 write
>   1137 lstat64
>    758 read
>    758 open
>    758 close
>     62 rt_sigaction
>     31 _llseek
>
> this seems quite inefficient!
>
> Note that the 758 instances of "read/open/close" are not recieving any data, they are just checking
> the processes memory usage:
>
>  open("/proc/self/statm", O_RDONLY|O_LARGEFILE) = 69
>  read(69, "223420 146018 5380 299 0 154771 "..., 46) = 34
>  close(69)                               = 0
>
> This is happening at a rate of ~10 Hz, which is not quite as bad as the ~2500Hz of "getrusage" calls.

   Are you sure that getrusage is called from XRootD and not from some
other place in your framework? I could not quite reproduce this issue
with xrdcp which uses the same underlying API.

Cheers,
   Lukasz