Print

Print


Hi,

I’m forwarding a long email thread about the data catalog (see below) which has a lot of good information in it.  In particular, Stepan’s response to my questions about existing systems at JLAB was very useful to me.

I think further discussion on this topic should be done here on hps-software, since the list of people I am CC’ing for this discussion is getting very long!  And some important people were being left out of the loop.

—Jeremy

Begin forwarded message:

From: Stepan Stepanyan <[log in to unmask]>
Subject: Re: HPS data catalog
Date: April 15, 2014 at 11:44:56 AM PDT
To: "McCormick, Jeremy I." <[log in to unmask]>

Hello Jeremy,

I talked to Sergey today about  that. He is thinking to restore our old
raw data catalog, or data base as we call it here.
He will need to do some updates here-there, may be some software
licenses must be purchased to have our old tools
working. So, raw runs/files (including their locations on tapes) and
with meta data will be stored.

As for the reconstructed data, so far we never had any problem to keep
track of them by just using tools available for tape
silo and cache disk access. The data processing will use these utilities
anyway and scientific computing DB keeps good
and reliable record of files. As long as data processing scripts keep
track of processed files, making sure all the files
have been processed and stored on tapes, they never will be lost or
names changed.

I think this is a good discussion that we should keep a live until
solutions are found. It will help if we can come up with
some scenarios when existing systems will fail or get confused.
Different run groups (experiments) used the existing
system and I have not heard any one had trouble to track their data. But
of course every thing is possible.

Regards, Stepan

On 4/15/14 2:27 PM, McCormick, Jeremy I. wrote:
Hi, Stepan.

Yes, this is extremely useful information.  Thank you for taking the time to put together all these details.

Do you have any opinion on whether enough this is “good enough” for HPS?  In speaking with John Jaros about this topic, it seems like he would prefer if the data management and the data catalog was a JLAB centric part of the software infrastructure.  I think this does make the most sense here.

My main concern would be that we don’t keep good records of our reconstruction data such as the LCIO or ROOT files that are derived from the EVIO, or replica files that are mirrored to SLAC.  Any thought about that?

I do like the data catalog application here at SLAC, but unfortunately it seems that there would need to be some code written for it to be fully functional with our data.  Technically speaking, this is because it would need to do partial file reads of LCIO and EVIO files via this XROOTD tool from the JLAB file system.  Furthermore, XROOTD would need to be running at JLAB for that to happen in order to allow remote file access.  So in terms of the configuration, running this application from SLAC is far more complicated and probably not worth the headache.  I very much like the feature set of the SLAC application, but I think we can probably get by with something more minimal based at JLAB, even if it doesn’t have the full feature set we’d ideally want.

Any feedback welcome.  Appreciate the input!

—Jeremy

On Apr 14, 2014, at 8:28 PM, Stepan Stepanyan <[log in to unmask]> wrote:

Hello Jeremy,

Here are few comments, hope this will help:
1) What is the name of the data catalog or data tracking system referred to by Stepan in previous emails that has been used in prior experiments at the lab?  Is there any web accessible documentation for it?  What is the scripting interface and how complete is it?  Will its scripts only work from the jlab domain?  Who maintains and updates this application?
We really never had "data tracking system", we have "data catalog"
system that tracks files from online to tape, fills out mySQL data base.
It has basic info on run number, number of files, number of events (in
each file separately), meta data like accumulated beam charge, trigger
settings, magnetic fields, time (gated, ungated ...)  ... Sergey can
give more detail info on how it works. It is web accessible.
2) Is this application tied into a public web page through which we can browse the data?  If there is no web frontend, are we okay with this?  Is the data only visible through it when logged into JLAB systems, either directly through database tools or a browser started from a shell session?  If the data catalog is not accessible at all outside JLAB, is this acceptable?
Yes, there is a public page where users can access that information
outside of JLAB (though I tried just now and looks like server is
down). There use to be tools on local on-line computers (I am sure they
still exist for old 6 GeV data) to extract info from DB, probably some
modifications will be needed to make it work with new data.
3) What tools have CLAS and other JLAB experiments used in the past to track their data?  Can they be used "out of the box" for HPS?  (related to #1)
I am not sure what do you mean by tracking data. After raw files have
been written to tapea and have been cataloged, computer center,
scientific computing, has catalog on which files were written on which
tapes, volumes .... This info is accessible through stub files.
4) Are the tools integrated with the jcache/jput software for getting files from deep storage?  How does this work?
I am not sure what the question is, what means deep storage. In order to
process data, there are few sets of scripts that run groups used to
jcache or run analysis on the batch farm
5) Does it make any sense to track remote replicas in this system, like ROOT files that have been mirrored to SLAC?  Is the data catalog capable of doing this in terms of having a field that indicates the "site" of the dataset?
Is this tracking thing should be a sophisticated system. Is a simple
comparison of file list at original and mirrored sites will not be enough?
6) What features on slide #2 of my talk does this system have and which of these should be considered crucial/essential and which optional?
bullet #1, I am not sure why we want to access quickly or download files
from web. In my mind this absolutely unpractical. Perhaps I am not
getting really what you are proposing here.
For the rest, I can do all you have with existing tools at JLAB, yet the
information in different bullets are not in the same catalog.
7) May we register files in the catalog from reconstruction and analysis?  Or is the catalog only something that is supposed to record the locations of the raw data e.g. the EVIO files?
Our catalog was only for raw files since it was common for all
experiments. Typically experimental groups keep track of analyzed data
using tools available for looking catalog of files on tapes and cache disks.
If we think we need more than what file storage and retrieving utilities
provide i snot enough, then we should make a new catalog, DB.
-----Original Message-----
From: Stepan Stepanyan [mailto:[log in to unmask]]
Sent: Friday, April 11, 2014 7:23 PM
To: McCormick, Jeremy I.; Francois-Xavier Girod
Cc: Sergey Boiarinov
Subject: Re: xrootd

Jeremy,

Thanks, it is now more clear what you want to do. Sorry I was not at the software meeting.
Our DB does not have most of what you want. It fully covers your bullets
3 and 4.

We have to start talking to our computer guys on general HPS needs.
Physics management
here preparing a review of HPS software requests. We should compile a list what we need.

Regards, Stepan

On 4/11/14, 8:33 PM, McCormick, Jeremy I. wrote:
Hi, Stepan.

Thanks for the information.

Actually, I don't want to duplicate effort, if possible.  Slide #2 of the talk that I attached indicates what I think we should have in terms of core functionality for a data catalog.

I have attached a PPT of some slides I presented at our last software meeting on the SLAC data catalog.

I think the advantages of the SLAC application are its complete feature set, good level of support from our computing division here, good documentation, and its accessibility through a public web application.  It will also likely get continual feature improvements going forward.  It has been used successfully for a number of experiments that had data across multiple sites.

The technical disadvantages are the inability to really integrate closely with the JLAB computing systems like jcache, and the fact that it must run at SLAC due to the technical architecture, which was part of the reason I asked about XROOTD.  It is not particularly well suited to experiments where most of the data is not at SLAC, though it can work for this if remote access is provided by the external sites.

I think my specific questions about the JLAB application would start with ...

1) Does it allow setting of arbitrary meta data on files?

2) Does it have a flexible organizational system or is it basically a flat namespace in one table?  Some level of hierarchical organization in order to keep track of data is preferred.

3) Can I register any type of file in it?  Is there a way to register files remotely or only from JLAB?  For instance, we'd like to keep track of DST files that have been mirrored to SLAC in this database.  There are also Monte Carlo LCIO files at SLAC that it would be useful to include in it.

4) Does it have a flexible query interface to get lists of files and their meta data or must one write SQL commands to do this?

5) Does it have a scripting interface for registering new files, adding meta data to existing files, deleting files, etc?  Or is this also done through SQL commands?

6) Can I access the database and/or the webpage from outside the JLAB domain?

7) Can we use it for our Monte Carlo data as well, like for instance the MDC or Test Run 2012 data?

Is there any documentation on the JLAB data catalog system that I can read in order to find out what it does and compare it to the SLAC one?  Is there an example webpage up right now containing data that I could access by running Firefox from a JLAB machine?

I would definitely like to hear more about the JLAB system with a comparison to the SLAC data catalog application.  It is possible that running both of these systems would have advantages, in that the JLAB data catalog would be tightly integrated with the systems there but more for internal bookkeeping, and the SLAC one could be one that is more publically accessible and user friendly but not tightly integrated with the data chaining done at JLAB (e.g. raw data -> LCIO -> recon etc.).  Ideally, though, there would be one application to do this task.

As far as just running XROOTD at JLAB, it is basically requires minimal effort from JLAB computing division once it has been setup.  I just assume you would want to update it periodically with new versions, but if they would let us, then the effort could be done by someone on HPS.  Of course, it would need some server machine to run on that had access to the file systems where HPS files are stored, and I don't know how that is typically handled at JLAB.  There would need to be inbound traffic allowed on some port to access the server, and, again, I don't know how that is typically handled.  The bandwidth requirements should be minimal, though that does first depend on some technical work here to make sure entire data files are not being sent across the internet from JLAB to SLAC.  The latter is not really feasible, obviously, given the large datasets that will be generated.

--Jeremy

-----Original Message-----
From: Stepan Stepanyan [mailto:[log in to unmask]]
Sent: Friday, April 11, 2014 4:56 PM
To: McCormick, Jeremy I.; Francois-Xavier Girod
Cc: Sergey Boiarinov
Subject: Re: xrootd

Jeremy,

I think this type of question we need to discuss more thoroughly and have Sergey being involved.
The so called data catalog that you are thinking of course exist at JLAB. We have mySQL data base for every run. It is web accessible and as being mySQL data base can be mirror anywhere you want. We already have everything that follows data path from counting house to tapes. If you want to duplicate that effort it is fine, no problem. But as FX said we need justification for additional efforts from our CC. But in any case I think we should involve Sergey into this conversation (I am cc-ing him on this email). I may be wrong on what we have and what you need.

Regards, Stepan
On 4/11/14, 7:45 PM, McCormick, Jeremy I. wrote:
Hi,

Thanks for the quicky reply.

Here are some of the technical details...

The idea would not be that this is the primary file server for users to directly read data remotely.  All the ROOT DST files will be mirrored to SLAC, so users here will access them directly on site via the file system or offsite through some public protocol like ftp to a SLAC file server.  The files at JLAB will be available either on disk when cached or can be moved to disk from tape storage using the JLAB tape system.

The particular application that I would like to use for a data catalog is based at SLAC and maintained by our computing division.  It has a data crawler which looks later at files that have been registered in it, in order to validate them and automatically set some meta data information such as run numbers and number of events in the file.  The way it does this efficiently on remote files involves reading a small portion of the registered data file which contains this information, via an XROOTD server on the remote site (in this case it would be JLAB).  For ROOT files, this is contained in one of the first header blocks of the files.  For LCIO files, there is also a single data block that can be read which gives the number of events and the run numbers.  So the XROOTD server basically would function as a bridge between the SLAC-based data catalog application and the data resident at JLAB, but it is not intended in this case for serving entire data sets across the internet from site site (though it can do that as well and is commonly used for this purpose).

I hope that makes sense.  I realize there are technical and reliability reasons why it might make more sense to run a data catalog app at JLAB, but I have not found or been informed of any software that will allow us to do this, at least nothing that is well-suited to HPS and has all the features one really wants from a data catalog application.  In terms of feature set, the SLAC SRS catalog is quite good and has been used for a number of experiments here that have data across multiple sites, so it could be used if we can overcome a few technical hurdles (the above being the primary one).

Another question having to do with this data catalog application...

How long will data be present on the cache disks before it is written to tape and then deleted?  The crawler would need to access these files before they are put to tape, though there are ways later to recover the information if it doesn't find the files present, e.g. by re-caching them to disk and telling the crawler to make another pass.  It would be best if they could be caught by the application while on disk for the first time though, so I'm wondering in general how long they will reside on disk before being archived to tape.

--Jeremy

-----Original Message-----
From: Francois-Xavier Girod [mailto:[log in to unmask]]
Sent: Friday, April 11, 2014 4:19 PM
To: McCormick, Jeremy I.
Cc: Stepan Stepanyan
Subject: Re: xrootd


Dear Jeremy,

It should be technically possible, but the question is whether the computer center will agree to set it up and maintain it. Can you prepare a short description justifying the application and the necessary ressources to submit to them ? What is the timescale you have in mind to organize this ?

What we have done in the past is to (1) run the event reconstruction from the large raw data here at JLab (cooking) and (2) transfer the smaller "cooked" dataset to the other lab, usually via bbftp (which is supported by the CC at JLab). It makes sense if people at the other lab want to access the entire cooked dataset more than once. The calibration databases can be duplicated at the remote site for better performances.

Best regards,
FX

----- Original Message -----
From: "Jeremy I. McCormick" <[log in to unmask]>
To: [log in to unmask]
Cc: "Stepan Stepanyan" <[log in to unmask]>
Sent: Friday, April 11, 2014 6:56:51 PM
Subject: xrootd

Hi,

I have a technical question regarding HPS and JLAB computing.

Do you think it would be possible for HPS to run an XROOTD server at JLAB for purposes of remote file access?

This is a standard application that is run at many HEP labs including SLAC and FNAL.

http://xrootd.org/

I am asking this particular question because I would like to use a SLAC data catalog application for our experiment (still up in the air as to whether we will or not), but it needs to be able to look at the files remotely in order to function effectively.  The best and most secure way to do this is using a connection to an XROOTD server, so I basically wanted to find out if that would be possible.

Specifically, XROOTD would allow the catalog server to read small portions of files across the internet in order to extract information from them without requiring that the whole file be read remotely.

Who would we talk to in order to find out if this is possible or not?

I have technical contacts here (the authors of XROOTD actually) who can provide more details if there are security concerns or other questions from the JLAB computing division.

--Jeremy






Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1