Our group ran some studies in charting xrootd performance when placed in the wan of a Lustre file system, and accessed locally. We are a T3 cluster, with 26 worker nodes, with 8 cores each. The local xrootd server is on a node with a 10 Gbps NIC connected to the campus uplink (also 10 Gbps). The Lustre storage is located 14ms rtt at a T2 and houses 72 optical drives.
We tested the xrootd server with a CPU bound Higgs analysis and a more I/O bound root performance routine. These jobs were also performed with direct connections from the T3 worker nodes to the Lustre file system (as a control) and additionally using an xrootd server at the T2 (to see the performance differences between the our xrootd server, which is remote from the Lustre storage, to an xrootd server local to the storage).
The results of the xrootd server in the wide area network of the storage were dismal. While the controls would average ~3 Gbps for test with up to 192 concurrent jobs, the test with the xrootd server in the wan of the Lustre storage was quickly saturated at 1 Gbps (even for tests with only 32 jobs!).
My question is:
- Does anyone know (and ideally can point to documentation) whether there is a design limitation that caps the performance of xrootd servers in the wan of the storage.
- The connection between the wan Lustre and our cluser is stable but jittery (rates fluctuate widely). Could this be impacting the xrootd performance?
Thank you very much for your response. Please, let me know if you're interested in receiving more details about our experiments.