On Feb 14, 2011, at 2:55 PM, Andrew Hanushevsky wrote:
> Hi Brian,
>
> That's a tough one. The first look is to see who created fd's that point to /dev/null. There is only three places where that's used in the base xrootd:
>
> 1) In connection management when a socket is being closed but associated objects still have references. But this code is only used by the cmsd, so I can't see this being your problem.
>
> 2) Posix plug-in where virtual file descriptors have real fd's pointing to /dev/null backups. I doubt this is your problem unless you are using XrdPosix in some plug-in.
>
> 3) In backgrounding where /dev/null is substituted for stdxxx fd's. I doubt this is your problem.
>
We found the issue - it was along these lines. The LCMAPS callout library has a plugin which uses a "noisy" library that has a tendency to spew all over std{err,out}. It created a copy of the file descriptors, redirected std{err,out} to /dev/null, did operations, and restored std{err,out} using dup2. However, it failed to close the copies, and leaked 2 fds every new login. Ouch.
Thanks!
Brian
> So, this would seem that one of the plug-ins is creating fd's to /dev/null then leaving it behind. I suspect that the plug-in is using the same trick the connection manager uses when it has to close an fd but there is still a reference outstanding (i.e., it merely substitutes a new /dev/null fd for the one being closed and waits for the references to go to zero).
>
> The other thing to look for is whether there are a lot of threads hanging on some resource; usually a sign of misbehavior.
>
> To find who creates such fd's you can attach strace to the server looking for open() and dup2 calls.
>
> Andy
>
> ----- Original Message -----
> From: Brian Bockelman
> To: Andrew Hanushevsky
> Cc: xrootd-dev ; Matevz Tadel
> Sent: Monday, February 14, 2011 11:32 AM
> Subject: Re: xrootd -- too many files open on uaf-3
>
> Hi Andy,
>
> What should I look for in the code? For Matevz, there's three plugins loaded, so we'll have to play detective.
>
> Brian
>
> On Feb 14, 2011, at 1:30 PM, Andrew Hanushevsky wrote:
>
>> Hi Brian,
>>
>> Usually it's a misbehaving plug-in. In general, we cannot tell who is leaking file descriptors but I doubt is the logging system as it's been in production for a decade and hasn't leaked any yet. A core file will likely be helpful when this happens.
>>
>> Andy
>> ----- Original Message -----
>> From: Brian Bockelman
>> To: xrootd-dev ; Matevz Tadel
>> Sent: Monday, February 14, 2011 4:31 AM
>> Subject: Fwd: xrootd -- too many files open on uaf-3
>>
>> Hey folks,
>>
>> Any idea what would cause xrootd to leak file descriptors?
>>
>> This is a version from "git master" about 2 months older than 3.0.0. We'll try updating today, but I'd be interested if anyone knew anything off the top of their head.
>>
>> Brian
>>
>> Begin forwarded message:
>>
>>> From: Matevz Tadel <[log in to unmask]>
>>> Date: February 13, 2011 12:38:19 PM CST
>>> To: Brian Bockelman <[log in to unmask]>
>>> Subject: xrootd -- too many files open on uaf-3
>>>
>>> Hi Brian,
>>>
>>> In Nagios I saw uaf-3 went down ... and it turns out that xrootd had problem with too many open files (ulimit is 2048).
>>>
>>> The funny thing is, it had an old log file opened 500-times, see the log fragments below.
>>>
>>> It all works ok after restart, thought I'd let you know.
>>>
>>> Best,
>>> Matevz
>>>
>>>
>>>> From lsof:
>>> ----------
>>> <many fds before>
>>> xrootd 7770 xrootd 992w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
>>> xrootd 7770 xrootd 993u CHR 1,3 2594 /dev/null
>>> xrootd 7770 xrootd 994w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
>>> xrootd 7770 xrootd 995u CHR 1,3 2594 /dev/null
>>> xrootd 7770 xrootd 996w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
>>> xrootd 7770 xrootd 997u CHR 1,3 2594 /dev/null
>>> xrootd 7770 xrootd 998w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
>>> xrootd 7770 xrootd 999u CHR 1,3 2594 /dev/null
>>> xrootd 7770 xrootd 1000w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
>>> xrootd 7770 xrootd 1001u CHR 1,3 2594 /dev/null
>>> xrootd 7770 xrootd 1002w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
>>> xrootd 7770 xrootd 1003u CHR 1,3 2594 /dev/null
>>> xrootd 7770 xrootd 1004w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
>>> xrootd 7770 xrootd 1005u CHR 1,3 2594 /dev/null
>>> xrootd 7770 xrootd 1006w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
>>> xrootd 7770 xrootd 1007u CHR 1,3 2594 /dev/null
>>> xrootd 7770 xrootd 1008w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
>>> xrootd 7770 xrootd 1009u CHR 1,3 2594 /dev/null
>>> xrootd 7770 xrootd 1010w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
>>> xrootd 7770 xrootd 1011u CHR 1,3 2594 /dev/null
>>> xrootd 7770 xrootd 1012w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
>>> xrootd 7770 xrootd 1013u CHR 1,3 2594 /dev/null
>>> xrootd 7770 xrootd 1014w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
>>> xrootd 7770 xrootd 1015u CHR 1,3 2594 /dev/null
>>> <many fds after>
>>>
>>>
>>> xrootd.log:
>>> -----------
>>> 110213 00:59:31 7770 nagios.26155:[log in to unmask] XrootdProtocol: more auth requested; sz=2066
>>> 110213 00:59:31 7770 XrootdXeq: nagios.26155:[log in to unmask] login as 92a2e9e2.0
>>> 110213 00:59:31 7770 acc_Audit: nagios.26155:[log in to unmask] grant gsi [log in to unmask] rea
>>> d /store/test/xrootd/T2_US_UCSD/store/mc/JobRobot/RelValProdTTbar/GEN-SIM-DIGI-RECO/MC_3XY_V24_JobRobot
>>> -v1/0001/56E18353-982C-DF11-B217-00304879FA4A.root
>>> java.lang.RuntimeException: java.io.FileNotFoundException: /etc/hadoop/hadoop-default.xml (Too many ope
>>> n files)
>>> at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1043)
>>> at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:940)
>>> at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:891)
>>> at org.apache.hadoop.conf.Configuration.set(Configuration.java:345)
>>> at org.apache.hadoop.security.UnixUserGroupInformation.saveToConf(UnixUserGroupInformation.java
>>> :182)
>>> <etc ....>
>>>
>>>
>>> xrootd hadoop.log:
>>> ------------------
>>> 2011-02-13 01:49:31,816 FATAL conf.Configuration (Configuration.java:loadResource(1042)) - error parsing conf file: java.io.FileNotFoundException: /etc/hadoop/hadoop-default.xml (Too many open files)
>>> <message repeated a million times>
>>
>>
>
>
|