Forwarding this email with a different email address, as Matevz
made me notice that the one below never reached the xrootd list.
-------- Messaggio Inoltrato
--------
Dear all,
we are using XCache for testing the integration of CINECA HPC with
CMS workflows. I'd say that everything looks quite good except for
a strange problem appearing rarely enough but quite related to the
number of client connected.
What we see is client failing to read a file with error "File
exists". After that the same file keep failing for all other
requests, so in a certain sense "corrupted".
We see this happening with different status of the file in cache,
sometimes it has size==0, sometimes is partially there, so no
particular pattern there. A typical error on xcache machine (with
a high enough (I hope) debug level for xrd and ofs) is like this
one (*).
Some additional details:
- fs where data are stored is on an high performance gpfs (ssd
underneath)
- network bandwith is 40Gpbs (used less then 2)
- machine and disk does not look busy at all. The load is less
than 1 for a 16 core machine, and the iowait is in practice null
- connected clients are around 500-1000
- the number of open file is stable at ~"lsof -p 3100 " with
limit for xrootd at 65k
- the configuration used is here (**) and the xrootd version is
4.12.3
Did you have any previous experience with anything similar? Could
you help us to understand what is happening?
Please let me know if you need any additional information.
Cheers,
Diego
(*)
https://gist.githubusercontent.com/dciangot/4b4cd9e625203c307b448a7176305ae8/raw/3317a3224db344fa731245721271aa0f4e7afce2/log.err
(**)
https://gist.githubusercontent.com/dciangot/d9ded6883c30ddd1fc871c378f9b5877/raw/991c7c25b5d1afe85dec4e1876f8ba278eea5225/cache.conf