Print

Print


Hi Matevz,

thanks for the suggestion, actually I received the very same suggestion 
by Andy but at the end it wasn't that.

It looks, instead, related to something wrong with ipv6 (as also pointed 
out by Andy off-list), and since I switched to ipv4 the problem seems 
solved. Hope it can help if someone else will hit a similar issue.

Thanks again, Diego


On 04/07/2018 16:43, Matevz Tadel wrote:
> Hi Diego,
>
> How do you have the ulimits set? The nofile also limits the number of 
> sockets.
>
> That's what we use at UCSD:
>
> [0741] root@xcache-01 ~# cat /etc/security/limits.d/50-xrootd.conf
> xrootd     soft    nproc     20000
> xrootd     hard    nproc     21000
> xrootd     soft    nofile    99000
> xrootd     hard    nofile    100000
> *          soft    core      unlimited
>
> Cheers,
> Matevz
>
> On 7/2/18 4:23 PM, Diego Ciangottini wrote:
>> Dear experts,
>>
>> I'm using a proxy file cache server to serve inputs for a computing 
>> cluster deployed on a cloud environment for CMS experiment workflows.
>> Actually it consists in a cluster of 3 machines under a common 
>> redirector, the flavor of the machine is with quite high RAM (256GB) 
>> and low latency high IO volume (10TB each), the bandwidth is 10Gbps 
>> for each server.
>>
>> So, the setup is working quite nicely but, starting from around 
>> 800-1000 concurrent jobs, we started to face with an increasing 
>> frequency connection errors server-side like this (*), corresponding 
>> to client failure as (**). Unfortunately I did not manage to find out 
>> more debugging information :/ Moreover they look not really 
>> correlated to the origin server chosen, so I suspect that could be 
>> something related to the network or the cache host machine 
>> configuration.
>> You can find here (***) the relevant part of the cache xrd 
>> configuration. Do you have any idea/guidance or previous experiences 
>> regarding this kind of issue?
>>
>> Cheers,
>> Diego
>>
>> (*)
>> [2018-07-02 13:51:27.968006 +0000][Error  ][AsyncSock         ] 
>> [[log in to unmask]:1094 #0.0] Unable to initiate the connection: 
>> [ERROR] Socket error: Network is unreachable
>> (**)
>> failure when reading from 192.168.77.20:32294 (unknown site); failed 
>> with error '[ERROR] Operation expired' (errno=0, code=206).
>> (***)
>> set rdtrCache=192.168.72.247
>> set rdtrPortCmsd=31112
>> set rdtrGlobal=xrootd-cms.infn.it
>> set rdtrGlobalPort=1094
>> set cacheLowWm=0.8
>> set cacheHiWm=0.9
>> set cacheLogLevel=error
>> set cachePath=/storage
>> set cacheRam=60
>> set cacheStreams=256
>> set prefetch=0
>> set blkSize=512k
>>
>> all.export /
>> all.role  server
>> oss.localroot $cachePath
>>
>> xrd.port 32294
>> ofs.osslib   libXrdPss.so
>> pss.cachelib libXrdFileCache.so
>>
>> pss.origin $rdtrGlobal:$rdtrGlobalPort
>>
>> pss.config streams 256 workers 16
>> pss.setopt ConnectTimeout 30
>> pss.setopt DebugLevel 3
>> pss.setopt RequestTimeout 30
>>
>> xrootd.seclib /usr/lib64/libXrdSec.so
>>
>> pfc.diskusage $cacheLowWm $cacheHiWm
>> pfc.ram       ${cacheRam}g
>>
>> pfc.blocksize   $blkSize
>> pfc.prefetch    $prefetch
>>
>> ########################################################################
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the XROOTD-L list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1