Hi,

We have two distinct compute clusters (PDSF & Carver) that we use to access an xrootd system that exists on one of those clusters (PDSF). The two clusters run the same OS and we link our software to the same client libraries. We have no problems accessing the service from the cluster where the service is located. It also functions from the second cluster (Carver) however the process allocates a huge amount of memory (seemingly related to file size and # of files read) that is not relinquished such that if we try to read multiple files, the jobs die with:

Xrd: PhyConnection: Can't run reader thread: out of system resources. Critical error.

(or sometimes crash with a 'bad alloc' error). I can recreate the symptom using xrdcp (our xrd version is v3.3.4). When I run with --debug #, I don't see any difference between the two systems. When I then run in valgrind, they are identical except valgrind issues several warnings from Carver of the type:

==22839== Warning: set address range perms: large range [0x39431000, 0x49432000) (defined)

with an address range nearly always 0x10001000 wide. The warning seems to be just a notice from valgrind that a large address range was allocated. No such warnings appear when run from PDSF.

At this point, the only difference I see between running from the two resources is the network topology, which I believe is largely IB connected. The admins are available to debug will some help for what to target.

thanks,

Jeff


Reply to this email directly or view it on GitHub.

{"@context":"http://schema.org","@type":"EmailMessage","description":"View this Issue on GitHub","action":{"@type":"ViewAction","url":"https://github.com/xrootd/xrootd/issues/142","name":"View Issue"}}

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1