Print

Print



(NB: Attachments to this message are posted at 
     http://www.mwt2.org/~cgw/xrd-logs/ .  I apologize
     for any problems caused by attempting to send email
     with excessively large attachments).

Hi - 

 At ANALY_MWT2 site we have some worker nodes which read from xrootd
storage across a (10Gb) WAN link.  Running ATLAS Hammercloud
stress-tests (test template 73 - muons) we see a cpu/walltime
efficiency of 75% for the local nodes vs 15% for the nodes reading
across WAN, and I'm trying to understand the reasons for this.

 I attached to the running process and obtained a stack trace as well
as 'strace' output, which is attached below.  Here's a trimmed-down version
of the stack trace (see attachment for full stack).

#0  0xffffe430 in __kernel_vsyscall ()
#1  0xf7712ef2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0xef23d426 in XrdSysCondVar::Wait(int) ()   from .../libNetx.so
#3  0xef19af48 in XrdClient::Read (this=0x16a44840, buf=0x1d70d600,offset=247982068, len=489) at XrdClient.cc:618
#4  0xef2236d4 in TXNetFile::ReadBuffer(char*, int) ()  from .../libNetx.so
#5  0xf586f623 in TBasket::ReadBasketBuffers(long long, int, TFile*)()from ../libTree.so
#6  0xf58783fc in TBranch::GetBasket(int) () from .../libTree.so


First off, note that the "XrdClient::Read" is only requisting 489 bytes in this read - 
no read-ahead or caching seems to be in effect.  As is well-known, many short reads
like this will not perform well across the WAN.  So I am interested in testing the various
read-ahead parameters, such as

 XNet.ReadAheadSize
 XNet.ReadAheadStrategy
 XNet.ReadTrimBlockSize
 XNet.ReadCacheSize

Note that the system.rootrc for this job has:
 XNet.ReadAheadSize: 0
 XNet.ReadCacheSize: 10000000
 XNet.ParStreamsPerPhyConn: 0

(from 16.0.3/sw/lcg/app/releases/ROOT/5.26.00e_python2.6/i686-slc5-gcc43-opt/root) - 

Question 1)
Why is ReadAheadSize set to 0 here?  And what's the best way to override this?
I found this document http://root.cern.ch/drupal/content/spin-little-disk-spin
which says (in reference to read-ahead/caching)


     We'd love to turn it on by default. But the read-ahead needs to put the data somewhere: we need some extra memory; 30 to 100MB are enough. But with the experiments' never ending quest for memory resources turning this on by default is not an option: a 4-core machine with 4 jobs would eat an additional 400MB. Too much.

     Instead you need to turn it on by yourself by calling

       tree->SetCacheSize(10000000);
       tree->AddBranchToCache("*");


I don't think we can force users to do this, is there somewhere else in the stack 
that this code could be inserted?


Question 2)

Despite the small read-size and 0 read-ahead there may be something else happening as 
well - note from the attached strace output (which covers a time-span of 90 seconds) 
that there8373 messages sent to fd 79 (the xroot server) and no replies recieved.  All of the
sent messages are 24 bytes in length:

 send(79, "\3\0\v\305\0\0\0\0\0\0\0\0\16\306R\0\0\0\4\0\0\0\0\0", 24, 0) = 24

these are going out at a rate of about 90 per second.   It seems that the messages
are being sent but no replies recieved - in the logs on the xroot server I see:

 110515 11:47:32 11957 XrootdXeq: usatlas1.12016:[log in to unmask] login
 110515 11:47:32 11957 usatlas1.12016:[log in to unmask] XrootdProtocol: endsess 11955:61.163901
 110515 11:47:32 11957 usatlas1.12016:[log in to unmask] XrootdProtocol: endsess 11955:61.163901 rc=-3 (No such process)
 110515 11:47:32 11957 usatlas1.12016:[log in to unmask] XrootdResponse: sending err 3011: session not found


Is this "session not found" the cause of the failures?

Note that when this is happening, the server load is low (<1), the network is not 
congested, etc so there is no reason to be expecting timeouts.


Question 3)

I also see from strace output that the code is calling 'getrusage' excessively, does this
really need to be checked 2500 times per second?

Here's a count of system calls made in 90 seconds.
 232452 getrusage
  77484 gettimeofday
  45394 time
  16750 futex
  15908 timer_settime
   9131 clock_gettime
   8373 send
   8373 poll
   1930 write
   1137 lstat64
    758 read
    758 open
    758 close
     62 rt_sigaction
     31 _llseek

this seems quite inefficient!  

Note that the 758 instances of "read/open/close" are not recieving any data, they are just checking
the processes memory usage:

 open("/proc/self/statm", O_RDONLY|O_LARGEFILE) = 69
 read(69, "223420 146018 5380 299 0 154771 "..., 46) = 34
 close(69)                               = 0

This is happening at a rate of ~10 Hz, which is not quite as bad as the ~2500Hz of "getrusage" calls.



Here are the gdb and strace outputs:

 http://www.mwt2.org/~cgw/xrd-logs/