LISTSERV 16.5 - ATLAS-SCCS-PLANNING-L Archives

Pre-compilation works with the 5.26-proof version. My AFS filesystem was read-only and the save didn't go through. One problem solved.

More observations (keep in mind all rates have ~10% uncertainty from run to run, maybe related to T2 storage load):

1. The GUI Is Screwy: GUI progress bar makes no sense, even when the job seems to be running. The processing time elapsed clock increments a second every 10 seconds or so, which makes the events/sec rates about 10X bigger than they actually are. That said, the # of events processed number is off, as the GUI said the job had processed ~50K events while each of the workers had processed between 30K and 40K events. I think the X11 forwarding from SLAC might not have the bandwidth to keep this updated in real time (though the PROOF Lite GUI on atlint01 works fine forwarded from SLAC...maybe it's this speedometer gizmo?) so I'm going to try batch mode to see if that helps.

Update: Batch mode fixes this, real time rates which update and make sense, and PROOF obeys stop commands now. Stay away from the GUI (at least remotely).

2. Speed scales linearly with number of workers for these tests (running off T2 xrootd). The ntuple used is around 90kb/evt and the analysis writes a lot of histograms and. With all 36 workers, 1550 evts/sec (~43 evts/sec/worker). Pretty slow.

3. Turning off most histograms gives about a 3-6% speed improvement (1625 evts/sec, 45 evts/sec/worker)--->negligible.

4. Turning all MC truth calculations off but still running on the same dataset: 1660 evts/sec, 46 evts/sec/worker.

5. Turning MC branches off but still running on the same dataset: 2075 evts/sec, 58 evts/sec/worker.

6. Use data instead of MC (no truth information and thus smaller ntuple ~83kb/evt). 3471 evt/sec, 96 evts/sec/worker.

7. Same as #6 but run from the proof cluster storage. Says it can't find the files. Is root://boer0123//atlas/proof/bcbutler/user10.ZacharyMarshall.data10_7TeV.00152409.physics_MinBias.recon.ESD.f238.V1.2010.04.12_AANT/*.root* not correct?

Exact error:
Srv err: Unable to open directory /atlas/proof/bcbutler/user10.ZacharyMarshall.data10_7TeV.00152409.physics_MinBias.recon.ESD.f238.V1.2010.04.12_AANT/

I'm not sure how helpful this is for deciding machinery other than confirming things we already knew...it's clear with this big ntuple (and this particular analysis) that the cluster's processing speed is input-limited rather than CPU limited, so maybe we should focus on more CPUs rather than faster CPUs? Small changes in ntuple size result in massive speed increases, though the picture is muddied a bit with the addition complication of turning branches on and off (which is clearly not a negligible effect, but smaller than slimming).

I'm not sure exactly what turning a branch off does when reading from a network drive...is the entire tree transferred and then only the enabled branches are loaded into memory? That would indicate that local disk speed would be something worthwhile to invest in, given turning truth branches off gave a 25% increase from test 4 to 5.

The biggest gain in speed was obtained from using a smaller ntuple (no truth) but the increase in speed seems very large compared to the actual size difference of the ntuples per event...I'm not sure how to interpret this.

-Bart

Yang, Wei wrote:

[log in to unmask]" type="cite">

I just sudo to bcbutler and ran root, and killed all your sessions:

root [1] TProof::Reset("boer0123",1);
 
| Message from server:
| CleanupSessions: hard-reset: signalling active sessions for termination
 
| Message from server:
| CleanupSessions: hard-reset: cleaning up client: requested by: bcbutler.7796:35@atl-prod05
| CleanupSessions: hard-reset: forwarding the reset request to next tier(s) 
| Send: cleanup request to [log in to unmask]:1093 for user: bcbutler
| Send: cleanup request to [log in to unmask]:1093 for user: bcbutler
| Send: cleanup request to [log in to unmask]:1093 for user: bcbutler
| Send: cleanup request to [log in to unmask]:1093 for user: bcbutler
| Send: cleanup request to [log in to unmask]:1093 for user: bcbutler
| Send: cleanup request to [log in to unmask]:1093 for user: bcbutler
| Send: cleanup request to [log in to unmask]:1093 for user: bcbutler
| Send: cleanup request to [log in to unmask]:1093 for user: bcbutler
| Send: cleanup request to [log in to unmask]:1093 for user: bcbutler

I actually ran it twice. The first time it seems to have killed all sessions on workers but left the one on master behind. The second time it kill the one on master as well.

regards,
Wei Yang  |  [log in to unmask]  |  650-926-3338(O)


On Jun 10, 2010, at 2:51 AM, Bart Butler wrote:

Another thing that has been consistently a problem with the cluster is the seeming inability to kill a job. I hit Cancel/Stop on the GUI and waited for about 10 minutes for the process to complete. This did nothing, so I hit Ctrl-C, which got me this:

Enter A/a to switch asynchronous, S/s to stop, Q/q to quit, any other key to continue: s
Info in <TSignalHandler::Notify>: Processing interrupt signal ... s

Info in <TMonitor::Select>: *** interrupt occured ***

Enter A/a to switch asynchronous, S/s to stop, Q/q to quit, any other key to continue: s
Info in <TSignalHandler::Notify>: Processing interrupt signal ... s
Info in <TMonitor::Select>: *** interrupt occured ***

Selecting q for quit doesn't seem to do anything either. Eventually I had to log into atlint01 again and kill the root process manually. On a subsequent attempt, I was able to reconnect successfully, but my previous session was still running (or crashed) and it just reconnected to that (stalled) session. I'm trying to kill that session again now so I can run a longer test.

Short version: there seems to be an issue with killing jobs cleanly which seems to leave the cluster in an unusable state (at least until some internal timeout completes?), but it does seem to be running (from the T2 storage at the moment, longer T2 vs. cluster storage performance test in the works assuming this session ever dies) using all workers provided the shared library is pre-compiled. This is something I saw in local sessions and back in May, so that it is not really surprising that it's an issue with the cluster too considering my code to make the package hasn't changed much. Figuring out on-the-fly compilation on the workers would be nice down the road. Getting the ability to kill jobs cleanly now is necessary though.

-Bart

Yang, Wei wrote:

Hi Bart,

I made another attempt. Here is what I used to start at client side (assuming bash) on a rhel5-64 machine

. /afs/slac/g/atlas/packages/gcc432/setup.sh
export ROOTSYS=/afs/slac.stanford.edu/g/atlas/packages/root/root5.26.00b-slc5_amd64-gcc43
export PATH=${PATH}:/$ROOTSYS/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:$ROOTSYS/lib
$ROOTSYS/bin/root

It seems I was able to load a .par file.  Can you give it a try? Also, remember on atlint01, if you copy a file to /xrootd/proof/bcbutler, you should TDset::Add("root://boer0123//atlas/proof/bcbulter/..."). However, I found that reading from T2 storage seems to be faster than reading from the disks in the proof cluster (without localizer).

regards,
Wei Yang  |  
[log in to unmask]
  |  650-926-3338(O)


On Apr 29, 2010, at 3:25 PM, Bart Butler wrote:

First thing first, I think I killed your cluster. The xrootd mount is no longer readable from atlint01 and I can't submit PROOF jobs to it anymore. This happened after killing my client root session manually after a massively-screwed up job.

Secondly, I am have a hell of a time compiling my shared library correctly. Which version of ROOT is the cluster running? If I'm not running the exact same root version and gcc version as every worker node, I can't make binaries (which is what Booker did with his test package it seems. I do it too when I run PROOF-Lite). And if I can't make binaries, I have to submit source packages. This should be fine but it's never worked well for me. My first theory was that because the packages are kept in a common place on xrootd in my user space, the compilation errors I was getting from some workers were because all 32 (I was never able to connect to 4 of the 36 workers) tried to compile the package at the same time in the same place. Running on a single worker worked fine (but of course was slow). I don't think this compilation issue was the whole story though, because if the single worker thing worked, the next time all workers should have been able to load the compiled version with

out problems assuming they are all running the same version of ROOT, and they crashed and burned just as badly that time. That's when the cluster itself crashed.

Another thing was that making TDSets from the Tier 2 xrootd storage worked fine, but when I tried using the same files I had copied to the cluster xrootd storage it couldn't find them for some reason.

My log files should be in /xrootd/proof/bcbutler if you guys get the cluster working again.

-Bart


Yang, Wei wrote:

Hi Bart, David,

any news on this?

regards,
Wei Yang  |  

[log in to unmask]

  |  650-926-3338(O)


On Apr 21, 2010, at 12:03 PM, Bart Butler wrote:

I'll try to run a few jobs tonight and see what happens.

-Bart

Yang, Wei wrote:

[add Andy Hass ...]

Hi David, Booker,

I mounted the xrootd space of the proof cluster at /xrootd/proof on atlint01.  It looks like we have ~1.8TB total on the cluster. So something ~ 1TB should work.

The cluster should be able to access T2 storage if your provide the URL of those root file to process. But the whole idea of using proof is to avoid network traffic as much as possible. As we are still validation the functions, it would be good to try both. Or if you put half of the data on proof cluster, and leave the other half on T2 storage (no NFS please). 

The proof master node is boer0123. If you copy files to the cluster, the xroot URL is root://boer0123//atlas/proof (I suggest you to create a fizisist sub-dir). 

Booker, it looks like proof also leaves some file in the cluster. How would you suggest to manage the space, by user, by group, or something else?

regards,
Wei Yang  |  


[log in to unmask]


  |  650-926-3338(O)


On Apr 21, 2010, at 8:40 AM, David W. Miller wrote:

Hi Booker and Wei,

I have a few questions: from what machine do we launch the jobs? Any machine at SLAC, but specifying the URI correctly? Also, if the data are on atlasuserdisk or usr in /xrootd/atlas/ is that sufficient?

Thanks,
David

On Apr 21, 2010, at 17:36 PM, Ariel Schwartzman wrote:

From: Booker Bense <[log in to unmask]>



Date: April 21, 2010 16:09:51 PM GMT+02:00
To: "Schwartzman, Ariel G." 


<[log in to unmask]>



Cc: "Yang, Wei" 


<[log in to unmask]>



Subject: Re: Proof cluster ready for testing


On Wed, 21 Apr 2010, Ariel Schwartzman wrote:

Hi Booker,

I cannot access this machine remotely:

ssh -Y boer0123.slac.stanford.edu

ssh: connect to host boer0123.slac.stanford.edu port 22: Operation timed out

It's on the slac internal network, you'll need to login to a slac 
machine and run root programs from there. You shouldn't need
login access to the master node.

_ Booker C. Bense

==========================================
David W. Miller
------------------------------------------
SLAC
Stanford University
Department of Physics

SLAC Info: Building 84, B-156. Tel: +1.650.926.3730
CERN Info: Building 01, 1-041. Tel: +41.76.487.2484

EMAIL:    


[log in to unmask]



HOMEPAGE: 


http://cern.ch/David.W.Miller



==========================================