Hi Matevz,
thanks for the suggestion, actually I received the very same suggestion
by Andy but at the end it wasn't that.
It looks, instead, related to something wrong with ipv6 (as also pointed
out by Andy off-list), and since I switched to ipv4 the problem seems
solved. Hope it can help if someone else will hit a similar issue.
Thanks again, Diego
On 04/07/2018 16:43, Matevz Tadel wrote:
> Hi Diego,
>
> How do you have the ulimits set? The nofile also limits the number of
> sockets.
>
> That's what we use at UCSD:
>
> [0741] root@xcache-01 ~# cat /etc/security/limits.d/50-xrootd.conf
> xrootd soft nproc 20000
> xrootd hard nproc 21000
> xrootd soft nofile 99000
> xrootd hard nofile 100000
> * soft core unlimited
>
> Cheers,
> Matevz
>
> On 7/2/18 4:23 PM, Diego Ciangottini wrote:
>> Dear experts,
>>
>> I'm using a proxy file cache server to serve inputs for a computing
>> cluster deployed on a cloud environment for CMS experiment workflows.
>> Actually it consists in a cluster of 3 machines under a common
>> redirector, the flavor of the machine is with quite high RAM (256GB)
>> and low latency high IO volume (10TB each), the bandwidth is 10Gbps
>> for each server.
>>
>> So, the setup is working quite nicely but, starting from around
>> 800-1000 concurrent jobs, we started to face with an increasing
>> frequency connection errors server-side like this (*), corresponding
>> to client failure as (**). Unfortunately I did not manage to find out
>> more debugging information :/ Moreover they look not really
>> correlated to the origin server chosen, so I suspect that could be
>> something related to the network or the cache host machine
>> configuration.
>> You can find here (***) the relevant part of the cache xrd
>> configuration. Do you have any idea/guidance or previous experiences
>> regarding this kind of issue?
>>
>> Cheers,
>> Diego
>>
>> (*)
>> [2018-07-02 13:51:27.968006 +0000][Error ][AsyncSock ]
>> [[log in to unmask]:1094 #0.0] Unable to initiate the connection:
>> [ERROR] Socket error: Network is unreachable
>> (**)
>> failure when reading from 192.168.77.20:32294 (unknown site); failed
>> with error '[ERROR] Operation expired' (errno=0, code=206).
>> (***)
>> set rdtrCache=192.168.72.247
>> set rdtrPortCmsd=31112
>> set rdtrGlobal=xrootd-cms.infn.it
>> set rdtrGlobalPort=1094
>> set cacheLowWm=0.8
>> set cacheHiWm=0.9
>> set cacheLogLevel=error
>> set cachePath=/storage
>> set cacheRam=60
>> set cacheStreams=256
>> set prefetch=0
>> set blkSize=512k
>>
>> all.export /
>> all.role server
>> oss.localroot $cachePath
>>
>> xrd.port 32294
>> ofs.osslib libXrdPss.so
>> pss.cachelib libXrdFileCache.so
>>
>> pss.origin $rdtrGlobal:$rdtrGlobalPort
>>
>> pss.config streams 256 workers 16
>> pss.setopt ConnectTimeout 30
>> pss.setopt DebugLevel 3
>> pss.setopt RequestTimeout 30
>>
>> xrootd.seclib /usr/lib64/libXrdSec.so
>>
>> pfc.diskusage $cacheLowWm $cacheHiWm
>> pfc.ram ${cacheRam}g
>>
>> pfc.blocksize $blkSize
>> pfc.prefetch $prefetch
>>
>> ########################################################################
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the XROOTD-L list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
|