(NB: Attachments to this message are posted at
http://www.mwt2.org/~cgw/xrd-logs/ . I apologize
for any problems caused by attempting to send email
with excessively large attachments).
Hi -
At ANALY_MWT2 site we have some worker nodes which read from xrootd
storage across a (10Gb) WAN link. Running ATLAS Hammercloud
stress-tests (test template 73 - muons) we see a cpu/walltime
efficiency of 75% for the local nodes vs 15% for the nodes reading
across WAN, and I'm trying to understand the reasons for this.
I attached to the running process and obtained a stack trace as well
as 'strace' output, which is attached below. Here's a trimmed-down version
of the stack trace (see attachment for full stack).
#0 0xffffe430 in __kernel_vsyscall ()
#1 0xf7712ef2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2 0xef23d426 in XrdSysCondVar::Wait(int) () from .../libNetx.so
#3 0xef19af48 in XrdClient::Read (this=0x16a44840, buf=0x1d70d600,offset=247982068, len=489) at XrdClient.cc:618
#4 0xef2236d4 in TXNetFile::ReadBuffer(char*, int) () from .../libNetx.so
#5 0xf586f623 in TBasket::ReadBasketBuffers(long long, int, TFile*)()from ../libTree.so
#6 0xf58783fc in TBranch::GetBasket(int) () from .../libTree.so
First off, note that the "XrdClient::Read" is only requisting 489 bytes in this read -
no read-ahead or caching seems to be in effect. As is well-known, many short reads
like this will not perform well across the WAN. So I am interested in testing the various
read-ahead parameters, such as
XNet.ReadAheadSize
XNet.ReadAheadStrategy
XNet.ReadTrimBlockSize
XNet.ReadCacheSize
Note that the system.rootrc for this job has:
XNet.ReadAheadSize: 0
XNet.ReadCacheSize: 10000000
XNet.ParStreamsPerPhyConn: 0
(from 16.0.3/sw/lcg/app/releases/ROOT/5.26.00e_python2.6/i686-slc5-gcc43-opt/root) -
Question 1)
Why is ReadAheadSize set to 0 here? And what's the best way to override this?
I found this document http://root.cern.ch/drupal/content/spin-little-disk-spin
which says (in reference to read-ahead/caching)
We'd love to turn it on by default. But the read-ahead needs to put the data somewhere: we need some extra memory; 30 to 100MB are enough. But with the experiments' never ending quest for memory resources turning this on by default is not an option: a 4-core machine with 4 jobs would eat an additional 400MB. Too much.
Instead you need to turn it on by yourself by calling
tree->SetCacheSize(10000000);
tree->AddBranchToCache("*");
I don't think we can force users to do this, is there somewhere else in the stack
that this code could be inserted?
Question 2)
Despite the small read-size and 0 read-ahead there may be something else happening as
well - note from the attached strace output (which covers a time-span of 90 seconds)
that there8373 messages sent to fd 79 (the xroot server) and no replies recieved. All of the
sent messages are 24 bytes in length:
send(79, "\3\0\v\305\0\0\0\0\0\0\0\0\16\306R\0\0\0\4\0\0\0\0\0", 24, 0) = 24
these are going out at a rate of about 90 per second. It seems that the messages
are being sent but no replies recieved - in the logs on the xroot server I see:
110515 11:47:32 11957 XrootdXeq: usatlas1.12016:[log in to unmask] login
110515 11:47:32 11957 usatlas1.12016:[log in to unmask] XrootdProtocol: endsess 11955:61.163901
110515 11:47:32 11957 usatlas1.12016:[log in to unmask] XrootdProtocol: endsess 11955:61.163901 rc=-3 (No such process)
110515 11:47:32 11957 usatlas1.12016:[log in to unmask] XrootdResponse: sending err 3011: session not found
Is this "session not found" the cause of the failures?
Note that when this is happening, the server load is low (<1), the network is not
congested, etc so there is no reason to be expecting timeouts.
Question 3)
I also see from strace output that the code is calling 'getrusage' excessively, does this
really need to be checked 2500 times per second?
Here's a count of system calls made in 90 seconds.
232452 getrusage
77484 gettimeofday
45394 time
16750 futex
15908 timer_settime
9131 clock_gettime
8373 send
8373 poll
1930 write
1137 lstat64
758 read
758 open
758 close
62 rt_sigaction
31 _llseek
this seems quite inefficient!
Note that the 758 instances of "read/open/close" are not recieving any data, they are just checking
the processes memory usage:
open("/proc/self/statm", O_RDONLY|O_LARGEFILE) = 69
read(69, "223420 146018 5380 299 0 154771 "..., 46) = 34
close(69) = 0
This is happening at a rate of ~10 Hz, which is not quite as bad as the ~2500Hz of "getrusage" calls.
Here are the gdb and strace outputs:
http://www.mwt2.org/~cgw/xrd-logs/
|