LISTSERV 16.5 - XROOTD-L Archives

2011/5/16 Charles G Waldman <[log in to unmask]>:
>  >  When the traffic optimization is supposed to happen in XRootD, you
>  > would see things like a XRootD::Read of len=489, because any external
>  > application (ROOT in this case) just needs to state what data it needs
>  > and the task of figuring out how to fetch it in the most optimal way
>  > is entrusted to XRootD which will optimize the actual network reads.
>  > The tweaking can then be done by the parameters you mention. Since
>  > XRootD is a generic data access software it has no knowledge of the
>  > underlying data file format so all it can do is some statistical guess
>  > work that may be more or less optimal.
>
> Yes, but for sites which are running a particular type of job - for
> example, a USATLAS site - the jobs have similar data-access patterns
> and it makes sense to have one set of read-ahead params for all the
> jobs, controlled by an environment variable or setup file.  For
> example at the MWT2 site, we were using dcap access, and were having
> good results with WAN reads once we set a few environment variables:
>
>  export DC_LOCAL_CACHE_MEMORY_PER_FILE=10000000
>  export DC_LOCAL_CACHE_BLOCK_SIZE=32768
>  export DCACHE_RA_BUFFER=16000
>
> which were applied to all ATLAS jobs.  We have not been able to get
> the same level of performance with xrootd across the WAN as we had
> gotten with dcache with the above env. vars.  There should be controls
> which can be adjusted by (knowledgeable) site admins, who can tune
> this to get optimal performance for their given operating parameters
> (network latency, disk I/O performance, memory limitation, etc).

   Yes, this is quite correct. You can do similar adjustment with
XRootD. Unfortunately in the older versions (coming with ROOT<5.28)
you can do that only by putting a .rootrc file in the CWD of the job
that you're running or in $ROOTSYS/etc, in versions coming with ROOT
>5.28 you can set it via envvars.

   You can have three values for ReadAheadStrategy:

0) no read ahead
1) sequential - read some data ahead of currently requested buffer
(how much should be read is specified by ReadAheadSize)
2) sliding window - window centered on the recent average slides
through the file following the stream of the requests

   ReadCacheSize denotes how much data should be cached per file.

>  > 2) When the traffic optimization is supposed to happen in ROOT, you
>  > would most probably see xrootd being asked to perform vector reads
>  > (XrdClient::ReadV) of around 30 megs. This is far more optimal because
>  > ROOT knows its file format and can easily predict which parts of the
>  > file it will need in the nearest future so it is able to prefetch data
>  > before it is needed by the application. The parameters here are the
>  > ones that you mention below: tree->SetCacheSize and friends.
>
> OK, but as I understand it, there is no way to turn this on at the
> site level - this can't be enabled in system.rootrc or via an env.
> var, can it?  It seems this requires users to rewrite their code.

   Correct.

>  > > Question 1)
>  > > Why is ReadAheadSize set to 0 here?  And what's the best way to override this?
>  >
>  >    Because most probably it is assumed that the optimization will be
>  > done in ROOT and XRootD should not bother to do anything.
>  >
>  > >     We'd love to turn it on by default. But the read-ahead needs to put the data somewhere: we need some extra memory; 30 to 100MB are enough. But with the experiments' never ending quest for memory resources turning this on by default is not an option: a 4-core machine with 4 jobs would eat an additional 400MB. Too much.
>  >
>  >    Well, if you turn on the read-ahead or any prefetching you need to
>  > store the additional data somewhere and RAM is the easiest target. We
>  > work on some code that will hopefully be committed to ROOT soon which
>  > will enable it to prefetch the data blocks and store them on disk.
>
> This could affect performance negatively - for a multi-core host, you
> are now creating more I/O to the local disk, which can become a
> bottleneck.  This is exactly the reason we prefer remote-access to
> stage-in for job inputs - having 24 or more jobs all access the same
> local disk can create I/O contention.  For worker nodes which are not
> short on RAM, the additional memory usage of a (modest) cache is not
> a problem.  (Using 10MB/file for dCache gave a huge performance boost
> and did not cause undue memory pressure.  We did not experiment with
> lower values but I suspect even 1MB/file would be enough to help
> significantly).

   Depends on the usecase, it also introduces parallel prefetching of
TreeCache buffers so the first results are quite promising. In any
case it will be possible to disable/tweak it depending on the
particular needs.

>
>  > >     Instead you need to turn it on by yourself by calling
>  > >
>  > >       tree->SetCacheSize(10000000);
>  > >       tree->AddBranchToCache("*");
>  > >
>  > >
>  > > I don't think we can force users to do this, is there somewhere else in the stack
>  > > that this code could be inserted?
>  >
>  >    Not really, since it's dependent on the user data that is being
>  > read. You could argue that it should use some caching by default but
>  > it's debatable.
>
> Well, we are comparing performance of xrootd and dcache.  For jobs
> accessing data across LAN, the performance is comparable.  But once
> the WAN enters the picture, the xrootd performace is poor compared to
> what we get with dCache.  So, I'm trying to at least reproduce the
> dCache performance results ... (using ATLAS "Hammercloud" job
> performance as my metric).

   Do similar tweaking that you have done for dCache :)

>
>  > > Question 2)
>  > > Is this "session not found" the cause of the failures?
>  >
>  >    This is a request to end a session that apparently does not
>  > exist... I will have a closer look. What is your access pattern? Do
>  > you have long standing jobs that keep the connections open for a long
>  > time, or is it more like the jobs fetching the data they need to
>  > process and quiting after the processing is over?
>
> This is a "canned" ATLAS job from the Hammercloud testing system - using
>
>  Test template: 73 (stress) - Muon 16.0.3.3 PANDA default data-access
>
> I'll do some research to find out more about what this job is actually
> doing.
>

   As already commented by Andy: this is a bug that will be fixed.
Thanks for reporting!

Cheers,
   Lukasz