Hi Matevz, thanks for the suggestion, actually I received the very same suggestion by Andy but at the end it wasn't that. It looks, instead, related to something wrong with ipv6 (as also pointed out by Andy off-list), and since I switched to ipv4 the problem seems solved. Hope it can help if someone else will hit a similar issue. Thanks again, Diego On 04/07/2018 16:43, Matevz Tadel wrote: > Hi Diego, > > How do you have the ulimits set? The nofile also limits the number of > sockets. > > That's what we use at UCSD: > > [0741] root@xcache-01 ~# cat /etc/security/limits.d/50-xrootd.conf > xrootd soft nproc 20000 > xrootd hard nproc 21000 > xrootd soft nofile 99000 > xrootd hard nofile 100000 > * soft core unlimited > > Cheers, > Matevz > > On 7/2/18 4:23 PM, Diego Ciangottini wrote: >> Dear experts, >> >> I'm using a proxy file cache server to serve inputs for a computing >> cluster deployed on a cloud environment for CMS experiment workflows. >> Actually it consists in a cluster of 3 machines under a common >> redirector, the flavor of the machine is with quite high RAM (256GB) >> and low latency high IO volume (10TB each), the bandwidth is 10Gbps >> for each server. >> >> So, the setup is working quite nicely but, starting from around >> 800-1000 concurrent jobs, we started to face with an increasing >> frequency connection errors server-side like this (*), corresponding >> to client failure as (**). Unfortunately I did not manage to find out >> more debugging information :/ Moreover they look not really >> correlated to the origin server chosen, so I suspect that could be >> something related to the network or the cache host machine >> configuration. >> You can find here (***) the relevant part of the cache xrd >> configuration. Do you have any idea/guidance or previous experiences >> regarding this kind of issue? >> >> Cheers, >> Diego >> >> (*) >> [2018-07-02 13:51:27.968006 +0000][Error ][AsyncSock ] >> [[log in to unmask]:1094 #0.0] Unable to initiate the connection: >> [ERROR] Socket error: Network is unreachable >> (**) >> failure when reading from 192.168.77.20:32294 (unknown site); failed >> with error '[ERROR] Operation expired' (errno=0, code=206). >> (***) >> set rdtrCache=192.168.72.247 >> set rdtrPortCmsd=31112 >> set rdtrGlobal=xrootd-cms.infn.it >> set rdtrGlobalPort=1094 >> set cacheLowWm=0.8 >> set cacheHiWm=0.9 >> set cacheLogLevel=error >> set cachePath=/storage >> set cacheRam=60 >> set cacheStreams=256 >> set prefetch=0 >> set blkSize=512k >> >> all.export / >> all.role server >> oss.localroot $cachePath >> >> xrd.port 32294 >> ofs.osslib libXrdPss.so >> pss.cachelib libXrdFileCache.so >> >> pss.origin $rdtrGlobal:$rdtrGlobalPort >> >> pss.config streams 256 workers 16 >> pss.setopt ConnectTimeout 30 >> pss.setopt DebugLevel 3 >> pss.setopt RequestTimeout 30 >> >> xrootd.seclib /usr/lib64/libXrdSec.so >> >> pfc.diskusage $cacheLowWm $cacheHiWm >> pfc.ram ${cacheRam}g >> >> pfc.blocksize $blkSize >> pfc.prefetch $prefetch >> >> ######################################################################## >> Use REPLY-ALL to reply to list >> >> To unsubscribe from the XROOTD-L list, click the following link: >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 > ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1