------ Forwarded Message From: Michael Ernst <[log in to unmask]> Date: Thu, 05 Mar 2009 18:09:30 -0500 To: Rob Gardner <[log in to unmask]>, "McKee, Shawn" <[log in to unmask]>, Saul Youssef <[log in to unmask]>, Wei Yang <[log in to unmask]> Subject: FW: re-processing campaign has started All, this is to let you know that the reprocessing exercise has started. This morning validation jobs were submitted to the US cloud and finished successfully at AGLT2. Please find more information from Kors below. -- Michael ------ Forwarded Message From: Kors Bos <[log in to unmask]> Date: Thu, 5 Mar 2009 17:47:31 -0500 To: <[log in to unmask]> Conversation: re-processing campaign has started Subject: re-processing campaign has started All, the validation tasks for the re-processing have been submitted or are about to be submitted. These are very much like the tasks that were run during the Christmas re-processing campaign. Last time these tests revealed a problem in FZK but that is now understood and fixed. So we don't expect any site to fail this time. However Taipei is down because of the fire and won't be back up in time to participate. These data have to be re-processed at CERN now. Tests have been performed at PIC and RAL to re-process data from tape. This still uses the old release of the reconstruction software but that is not important for this test. At PIC this went very well except that the task didn't finish because we seem to have a broken tape. It is good that this happens now because it gives us the opportunity to test how to fix this. We can be sure that this will happen again. Also at RAL this test is going well although a bit slower (on purpose). All cosmics data has been cleaned from the buffers so the same tests can start in the other T1's. Lyon will bring the files on-line manually because there is still one component missing to have pre- staging been done by the site services. These tests tell us if pre- staging works and if the buffer turn over is more or less optimal for the jobs. The plan still is to start the real re-processing of all the cosmics data next week. We know that there is a few day shut down at FZK so they will probably start a little bit later. We don't have to remove the RAW data from the disks because Panda can now distinguish between the copy on disk and the copy on tape and be made to choose the tape copy. This gives us a fall-back in case we do have problems with reading from tape. We should hope not to need this fall back solution because with real data we won't have an extra copy on disk. We will have 2 measures against the "hot file" problem. There will be a conditions data tar ball per run and not one for all 100 runs together as we had over Christmas. So there will be fewer jobs at the same time trying to access these data. Secondly at a few sites we will test the "pcache solution" where the conditions data will be left on the worker node after the job has finished. If the next job on that node needs the same data it will just use it and not bring a fresh copy in. During Christmas and New Year few people were available in the sites. This time we hope you will monitor closely this effort and report any irregularity to us. We need to measure how efficiently we can do re- processing, how many cpu's we use and how long it takes. We need to know if the stage buffer matches the number of tape drives to fill it and the number of cpu's to use it. And then there are the exceptions like broken tapes, files that seem to be missing for other reasons, job crashes and so on. This may be one of the last chances to test before we need it all working for real. Kors ------ End of Forwarded Message ------ End of Forwarded Message