(NB: Attachments to this message are posted at http://www.mwt2.org/~cgw/xrd-logs/ . I apologize for any problems caused by attempting to send email with excessively large attachments). Hi - At ANALY_MWT2 site we have some worker nodes which read from xrootd storage across a (10Gb) WAN link. Running ATLAS Hammercloud stress-tests (test template 73 - muons) we see a cpu/walltime efficiency of 75% for the local nodes vs 15% for the nodes reading across WAN, and I'm trying to understand the reasons for this. I attached to the running process and obtained a stack trace as well as 'strace' output, which is attached below. Here's a trimmed-down version of the stack trace (see attachment for full stack). #0 0xffffe430 in __kernel_vsyscall () #1 0xf7712ef2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0xef23d426 in XrdSysCondVar::Wait(int) () from .../libNetx.so #3 0xef19af48 in XrdClient::Read (this=0x16a44840, buf=0x1d70d600,offset=247982068, len=489) at XrdClient.cc:618 #4 0xef2236d4 in TXNetFile::ReadBuffer(char*, int) () from .../libNetx.so #5 0xf586f623 in TBasket::ReadBasketBuffers(long long, int, TFile*)()from ../libTree.so #6 0xf58783fc in TBranch::GetBasket(int) () from .../libTree.so First off, note that the "XrdClient::Read" is only requisting 489 bytes in this read - no read-ahead or caching seems to be in effect. As is well-known, many short reads like this will not perform well across the WAN. So I am interested in testing the various read-ahead parameters, such as XNet.ReadAheadSize XNet.ReadAheadStrategy XNet.ReadTrimBlockSize XNet.ReadCacheSize Note that the system.rootrc for this job has: XNet.ReadAheadSize: 0 XNet.ReadCacheSize: 10000000 XNet.ParStreamsPerPhyConn: 0 (from 16.0.3/sw/lcg/app/releases/ROOT/5.26.00e_python2.6/i686-slc5-gcc43-opt/root) - Question 1) Why is ReadAheadSize set to 0 here? And what's the best way to override this? I found this document http://root.cern.ch/drupal/content/spin-little-disk-spin which says (in reference to read-ahead/caching) We'd love to turn it on by default. But the read-ahead needs to put the data somewhere: we need some extra memory; 30 to 100MB are enough. But with the experiments' never ending quest for memory resources turning this on by default is not an option: a 4-core machine with 4 jobs would eat an additional 400MB. Too much. Instead you need to turn it on by yourself by calling tree->SetCacheSize(10000000); tree->AddBranchToCache("*"); I don't think we can force users to do this, is there somewhere else in the stack that this code could be inserted? Question 2) Despite the small read-size and 0 read-ahead there may be something else happening as well - note from the attached strace output (which covers a time-span of 90 seconds) that there8373 messages sent to fd 79 (the xroot server) and no replies recieved. All of the sent messages are 24 bytes in length: send(79, "\3\0\v\305\0\0\0\0\0\0\0\0\16\306R\0\0\0\4\0\0\0\0\0", 24, 0) = 24 these are going out at a rate of about 90 per second. It seems that the messages are being sent but no replies recieved - in the logs on the xroot server I see: 110515 11:47:32 11957 XrootdXeq: usatlas1.12016:[log in to unmask] login 110515 11:47:32 11957 usatlas1.12016:[log in to unmask] XrootdProtocol: endsess 11955:61.163901 110515 11:47:32 11957 usatlas1.12016:[log in to unmask] XrootdProtocol: endsess 11955:61.163901 rc=-3 (No such process) 110515 11:47:32 11957 usatlas1.12016:[log in to unmask] XrootdResponse: sending err 3011: session not found Is this "session not found" the cause of the failures? Note that when this is happening, the server load is low (<1), the network is not congested, etc so there is no reason to be expecting timeouts. Question 3) I also see from strace output that the code is calling 'getrusage' excessively, does this really need to be checked 2500 times per second? Here's a count of system calls made in 90 seconds. 232452 getrusage 77484 gettimeofday 45394 time 16750 futex 15908 timer_settime 9131 clock_gettime 8373 send 8373 poll 1930 write 1137 lstat64 758 read 758 open 758 close 62 rt_sigaction 31 _llseek this seems quite inefficient! Note that the 758 instances of "read/open/close" are not recieving any data, they are just checking the processes memory usage: open("/proc/self/statm", O_RDONLY|O_LARGEFILE) = 69 read(69, "223420 146018 5380 299 0 154771 "..., 46) = 34 close(69) = 0 This is happening at a rate of ~10 Hz, which is not quite as bad as the ~2500Hz of "getrusage" calls. Here are the gdb and strace outputs: http://www.mwt2.org/~cgw/xrd-logs/