Lukasz: Thanks for the reply - there's still some work to do to understand
this fully. Initial responses are inline.
> There issue is a bit complicated here. The job that you are
> analyzing is a ROOT job that uses XRootD plugin to access data over
> the network. In your stack trace the entry point to XRootD is:
> XrdClient::Read, everything above (#x < 3) is XRootD and everything
> below (#x > 3) is ROOT.
Thanks for the clarification.
> When the traffic optimization is supposed to happen in XRootD, you
> would see things like a XRootD::Read of len=489, because any external
> application (ROOT in this case) just needs to state what data it needs
> and the task of figuring out how to fetch it in the most optimal way
> is entrusted to XRootD which will optimize the actual network reads.
> The tweaking can then be done by the parameters you mention. Since
> XRootD is a generic data access software it has no knowledge of the
> underlying data file format so all it can do is some statistical guess
> work that may be more or less optimal.
Yes, but for sites which are running a particular type of job - for
example, a USATLAS site - the jobs have similar data-access patterns
and it makes sense to have one set of read-ahead params for all the
jobs, controlled by an environment variable or setup file. For
example at the MWT2 site, we were using dcap access, and were having
good results with WAN reads once we set a few environment variables:
export DC_LOCAL_CACHE_MEMORY_PER_FILE=10000000
export DC_LOCAL_CACHE_BLOCK_SIZE=32768
export DCACHE_RA_BUFFER=16000
which were applied to all ATLAS jobs. We have not been able to get
the same level of performance with xrootd across the WAN as we had
gotten with dcache with the above env. vars. There should be controls
which can be adjusted by (knowledgeable) site admins, who can tune
this to get optimal performance for their given operating parameters
(network latency, disk I/O performance, memory limitation, etc).
> 2) When the traffic optimization is supposed to happen in ROOT, you
> would most probably see xrootd being asked to perform vector reads
> (XrdClient::ReadV) of around 30 megs. This is far more optimal because
> ROOT knows its file format and can easily predict which parts of the
> file it will need in the nearest future so it is able to prefetch data
> before it is needed by the application. The parameters here are the
> ones that you mention below: tree->SetCacheSize and friends.
OK, but as I understand it, there is no way to turn this on at the
site level - this can't be enabled in system.rootrc or via an env.
var, can it? It seems this requires users to rewrite their code.
> > Question 1)
> > Why is ReadAheadSize set to 0 here? And what's the best way to override this?
>
> Because most probably it is assumed that the optimization will be
> done in ROOT and XRootD should not bother to do anything.
>
> > We'd love to turn it on by default. But the read-ahead needs to put the data somewhere: we need some extra memory; 30 to 100MB are enough. But with the experiments' never ending quest for memory resources turning this on by default is not an option: a 4-core machine with 4 jobs would eat an additional 400MB. Too much.
>
> Well, if you turn on the read-ahead or any prefetching you need to
> store the additional data somewhere and RAM is the easiest target. We
> work on some code that will hopefully be committed to ROOT soon which
> will enable it to prefetch the data blocks and store them on disk.
This could affect performance negatively - for a multi-core host, you
are now creating more I/O to the local disk, which can become a
bottleneck. This is exactly the reason we prefer remote-access to
stage-in for job inputs - having 24 or more jobs all access the same
local disk can create I/O contention. For worker nodes which are not
short on RAM, the additional memory usage of a (modest) cache is not
a problem. (Using 10MB/file for dCache gave a huge performance boost
and did not cause undue memory pressure. We did not experiment with
lower values but I suspect even 1MB/file would be enough to help
significantly).
> > Instead you need to turn it on by yourself by calling
> >
> > tree->SetCacheSize(10000000);
> > tree->AddBranchToCache("*");
> >
> >
> > I don't think we can force users to do this, is there somewhere else in the stack
> > that this code could be inserted?
>
> Not really, since it's dependent on the user data that is being
> read. You could argue that it should use some caching by default but
> it's debatable.
Well, we are comparing performance of xrootd and dcache. For jobs
accessing data across LAN, the performance is comparable. But once
the WAN enters the picture, the xrootd performace is poor compared to
what we get with dCache. So, I'm trying to at least reproduce the
dCache performance results ... (using ATLAS "Hammercloud" job
performance as my metric).
> > Question 2)
> > Is this "session not found" the cause of the failures?
>
> This is a request to end a session that apparently does not
> exist... I will have a closer look. What is your access pattern? Do
> you have long standing jobs that keep the connections open for a long
> time, or is it more like the jobs fetching the data they need to
> process and quiting after the processing is over?
This is a "canned" ATLAS job from the Hammercloud testing system - using
Test template: 73 (stress) - Muon 16.0.3.3 PANDA default data-access
I'll do some research to find out more about what this job is actually
doing.
> > Question 3)
> >
> > I also see from strace output that the code is calling 'getrusage' excessively, does this
> > really need to be checked 2500 times per second?
>
> Are you sure that getrusage is called from XRootD and not from some
> other place in your framework? I could not quite reproduce this issue
> with xrdcp which uses the same underlying API.
Good point, this could be coming from anywhere - I'll try putting
a breakpoint on "getrusage" and see what I find.
Thanks!
- Charles
|