Hi Andy, Brian,
So, I have a couple of xrootds hanging with this problem now :)
The last three log files are still kept open.
On 02/14/11 21:55, Andrew Hanushevsky wrote:
> Hi Brian,
>
> That's a tough one. The first look is to see who created fd's that point to /dev/null. There is only three places where that's used in the base xrootd:
>
> 1) In connection management when a socket is being closed but associated objects still have references. But this code is only used by the cmsd, so I can't see this being your problem.
>
> 2) Posix plug-in where virtual file descriptors have real fd's pointing to /dev/null backups. I doubt this is your problem unless you are using XrdPosix in some plug-in.
>
> 3) In backgrounding where /dev/null is substituted for stdxxx fd's. I doubt this is your problem.
>
> So, this would seem that one of the plug-ins is creating fd's to /dev/null then leaving it behind. I suspect that the plug-in is using the same trick the connection manager uses when it has to close an fd but there is still a reference outstanding (i.e., it merely substitutes a new /dev/null fd for the one being closed and waits for the references to go to zero).
Brian, is hadoop plugin using this?
> The other thing to look for is whether there are a lot of threads hanging on some resource; usually a sign of misbehavior.
(gdb) info thr
32 Thread 0x41101940 (LWP 24209) 0x000000351260b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
31 Thread 0x41202940 (LWP 24210) 0x000000351260b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
30 Thread 0x41303940 (LWP 24211) 0x000000351260cd01 in sem_wait () from /lib64/libpthread.so.0
29 Thread 0x41404940 (LWP 24212) 0x000000351260cd01 in sem_wait () from /lib64/libpthread.so.0
28 Thread 0x41fc1940 (LWP 24213) 0x0000003511ad4358 in epoll_wait () from /lib64/libc.so.6
27 Thread 0x420c2940 (LWP 24214) 0x0000003511ad4358 in epoll_wait () from /lib64/libc.so.6
26 Thread 0x421c3940 (LWP 24215) 0x0000003511ad4358 in epoll_wait () from /lib64/libc.so.6
25 Thread 0x422c4940 (LWP 24218) 0x000000351260e1c1 in nanosleep () from /lib64/libpthread.so.0
24 Thread 0x423c5940 (LWP 24219) 0x000000351260d91b in read () from /lib64/libpthread.so.0
23 Thread 0x424c6940 (LWP 24220) 0x000000351260cd01 in sem_wait () from /lib64/libpthread.so.0
22 Thread 0x408a1940 (LWP 24221) 0x000000351260dc81 in recv () from /lib64/libpthread.so.0
21 Thread 0x41ae0940 (LWP 24225) 0x000000351260db3b in accept () from /lib64/libpthread.so.0
20 Thread 0x41695940 (LWP 24273) 0x000000351260aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
19 Thread 0x41796940 (LWP 24274) 0x000000351260aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
18 Thread 0x41897940 (LWP 24275) 0x000000351260aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
17 Thread 0x41998940 (LWP 24276) 0x000000351260aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
16 Thread 0x425c7940 (LWP 24277) 0x000000351260aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
15 Thread 0x426c8940 (LWP 24278) 0x000000351260aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
14 Thread 0x427c9940 (LWP 24279) 0x000000351260aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
13 Thread 0x428ca940 (LWP 24280) 0x000000351260aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
12 Thread 0x41be1940 (LWP 24281) 0x000000351260b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
11 Thread 0x41ce2940 (LWP 24282) 0x000000351260aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
10 Thread 0x41de3940 (LWP 24283) 0x000000351260aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
9 Thread 0x429cb940 (LWP 24284) 0x000000351260cd01 in sem_wait () from /lib64/libpthread.so.0
8 Thread 0x42acc940 (LWP 24285) 0x000000351260cd01 in sem_wait () from /lib64/libpthread.so.0
7 Thread 0x42bcd940 (LWP 24286) 0x000000351260aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
6 Thread 0x42cce940 (LWP 24287) 0x000000351260aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
5 Thread 0x42dcf940 (LWP 24288) 0x000000351260aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
4 Thread 0x42ed0940 (LWP 24289) 0x000000351260b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
3 Thread 0x42fd1940 (LWP 26191) 0x000000351260cd01 in sem_wait () from /lib64/libpthread.so.0
2 Thread 0x40cd2940 (LWP 10329) 0x000000351260cd01 in sem_wait () from /lib64/libpthread.so.0
* 1 Thread 0x2b47bef62e20 (LWP 24208) 0x000000351260db3b in accept () from /lib64/libpthread.so.0
Does this seem "normal"? :)
I looked at a couple of them and they are mostly java threads, waiting in various places.
> To find who creates such fd's you can attach strace to the server looking for open() and dup2 calls.
I'll try that now on one that is still alive.
Cheers,
Matevz
> ----- Original Message -----
> From: Brian Bockelman
> To: Andrew Hanushevsky
> Cc: xrootd-dev ; Matevz Tadel
> Sent: Monday, February 14, 2011 11:32 AM
> Subject: Re: xrootd -- too many files open on uaf-3
>
>
> Hi Andy,
>
>
> What should I look for in the code? For Matevz, there's three plugins loaded, so we'll have to play detective.
>
>
> Brian
>
>
> On Feb 14, 2011, at 1:30 PM, Andrew Hanushevsky wrote:
>
>
> Hi Brian,
>
> Usually it's a misbehaving plug-in. In general, we cannot tell who is leaking file descriptors but I doubt is the logging system as it's been in production for a decade and hasn't leaked any yet. A core file will likely be helpful when this happens.
>
> Andy
> ----- Original Message -----
> From: Brian Bockelman
> To: xrootd-dev ; Matevz Tadel
> Sent: Monday, February 14, 2011 4:31 AM
> Subject: Fwd: xrootd -- too many files open on uaf-3
>
>
> Hey folks,
>
>
> Any idea what would cause xrootd to leak file descriptors?
>
>
> This is a version from "git master" about 2 months older than 3.0.0. We'll try updating today, but I'd be interested if anyone knew anything off the top of their head.
>
>
> Brian
>
>
>
> Begin forwarded message:
>
>
> From: Matevz Tadel <[log in to unmask]>
>
> Date: February 13, 2011 12:38:19 PM CST
>
> To: Brian Bockelman <[log in to unmask]>
>
> Subject: xrootd -- too many files open on uaf-3
>
>
>
> Hi Brian,
>
> In Nagios I saw uaf-3 went down ... and it turns out that xrootd had problem with too many open files (ulimit is 2048).
>
> The funny thing is, it had an old log file opened 500-times, see the log fragments below.
>
> It all works ok after restart, thought I'd let you know.
>
> Best,
> Matevz
>
>
>
> From lsof:
>
> ----------
> <many fds before>
> xrootd 7770 xrootd 992w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
> xrootd 7770 xrootd 993u CHR 1,3 2594 /dev/null
> xrootd 7770 xrootd 994w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
> xrootd 7770 xrootd 995u CHR 1,3 2594 /dev/null
> xrootd 7770 xrootd 996w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
> xrootd 7770 xrootd 997u CHR 1,3 2594 /dev/null
> xrootd 7770 xrootd 998w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
> xrootd 7770 xrootd 999u CHR 1,3 2594 /dev/null
> xrootd 7770 xrootd 1000w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
> xrootd 7770 xrootd 1001u CHR 1,3 2594 /dev/null
> xrootd 7770 xrootd 1002w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
> xrootd 7770 xrootd 1003u CHR 1,3 2594 /dev/null
> xrootd 7770 xrootd 1004w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
> xrootd 7770 xrootd 1005u CHR 1,3 2594 /dev/null
> xrootd 7770 xrootd 1006w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
> xrootd 7770 xrootd 1007u CHR 1,3 2594 /dev/null
> xrootd 7770 xrootd 1008w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
> xrootd 7770 xrootd 1009u CHR 1,3 2594 /dev/null
> xrootd 7770 xrootd 1010w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
> xrootd 7770 xrootd 1011u CHR 1,3 2594 /dev/null
> xrootd 7770 xrootd 1012w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
> xrootd 7770 xrootd 1013u CHR 1,3 2594 /dev/null
> xrootd 7770 xrootd 1014w REG 9,0 341353 3303138 /var/log/xrootd/xrootd.log.20110211
> xrootd 7770 xrootd 1015u CHR 1,3 2594 /dev/null
> <many fds after>
>
>
> xrootd.log:
> -----------
> 110213 00:59:31 7770 nagios.26155:[log in to unmask] XrootdProtocol: more auth requested; sz=2066
> 110213 00:59:31 7770 XrootdXeq: nagios.26155:[log in to unmask] login as 92a2e9e2.0
> 110213 00:59:31 7770 acc_Audit: nagios.26155:[log in to unmask] grant gsi [log in to unmask] rea
> d /store/test/xrootd/T2_US_UCSD/store/mc/JobRobot/RelValProdTTbar/GEN-SIM-DIGI-RECO/MC_3XY_V24_JobRobot
> -v1/0001/56E18353-982C-DF11-B217-00304879FA4A.root
> java.lang.RuntimeException: java.io.FileNotFoundException: /etc/hadoop/hadoop-default.xml (Too many ope
> n files)
> at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1043)
> at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:940)
> at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:891)
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:345)
> at org.apache.hadoop.security.UnixUserGroupInformation.saveToConf(UnixUserGroupInformation.java
> :182)
> <etc ....>
>
>
> xrootd hadoop.log:
> ------------------
> 2011-02-13 01:49:31,816 FATAL conf.Configuration (Configuration.java:loadResource(1042)) - error parsing conf file: java.io.FileNotFoundException: /etc/hadoop/hadoop-default.xml (Too many open files)
> <message repeated a million times>
>
>
>
>
>
>
>
|