LISTSERV 16.5 - LSST-DESC-GRID Archives

Hi everyone,

We’ve temporarily removed IN2P3 from the list of destination sites, until we can resolve the ImSim distribution problem—to allow other sites to continue to process DC2 jobs.

Some of us have a telco on Friday 21^st, so I suggest we add to the agenda for that,

George.

From: [log in to unmask] <[log in to unmask]> On Behalf Of Alessandra Forti
Sent: 07 June 2019 16:36
To: Fabio Hernandez <[log in to unmask]>; PERRY James <[log in to unmask]>
Cc: Dominique Boutigny <[log in to unmask]>; LSST-DESC-GRID <[log in to unmask]>
Subject: Re: problems with LSST software tarball

Hi Fabio,

I don't understand what is the problem with using the gridpp CVMFS repository? I think it is available everywhere in EGI as it should be part of the EGI CVMFS configuration rpm. I can certainly see it also from lxplus (CERN)

aforti@lxplus783> ls /cvmfs/gridpp.egi.eu/lsst
sims_2_8_0 sims_2_9_0 sims_w_2019_10_1

cheers
alessandra

On 07/06/2019 16:06, Fabio Hernandez wrote:

James,

I propose we explore storing imSim tarball in Dirac and make several replicas, including one replica at one SE at CC-IN2P3.

Do you think that would be compatible with the mechanism that you use to submit and execute the jobs?

Cheers,

Fabio Hernandez

CNRS – IN2P3 computing centre · Lyon (France)     ·     e-mail: [log in to unmask]     ·     tel: +33 4 78 93 08 80

On 7 Jun 2019, at 16:52, PERRY James <[log in to unmask]> wrote:

Hi Dominique,

It's the ImSim code. I think someone recommended not to clone it from
GitHub as this can result in GitHub blacklisting the worker nodes if
they do this too many times, so I went with the tarball instead. We
could try it using GitHub directly if you think it would be safe.

Cheers,
James

On 07/06/2019 15:49, Dominique Boutigny wrote:

Hi Alessandra and James,

I add Fabio in the loop.
I don't think that there is any problem to copy the tar ball at CC-IN2P3.
By the way what is this tar ball ? Is that the instance catalog or the
imsim code ? If it is imsim, I thought that we decided to download it
from github and to build it locally as it is very fast to do so.

Cheers,

Dominique

On 07/06/2019 16:30, Alessandra Forti wrote:

We upgraded the system and changed the storage system configuration so
there might be other factors at play, but this was the first thing
that jumped out and until we reduce it we cannot know if other things
are affecting the responsiveness of the storage.

Said that 1500 processes trying to access 1 file on 1 machine is not
healthy.

cheers
alessandra

On 07/06/2019 15:25, PERRY James wrote:

Hi Alessandra,

The site is CC. They didn't seem to want to mount the cvmfs repository
but maybe we could convince them to.

I can download the file explicitly instead when required. Sorry, I
hadn't realised that this would put such a load on the system.

Thanks,
James

On 07/06/2019 15:16, Alessandra Forti wrote:

Hi James,

Is there a reason why they can't mount it? Is it LAPP or CC?

I would recommend that you don't use the software as an input but you
download it explicitely from the job if you cannot find it in CVMFS.
And/or the tarball should be copied to the French site storage closest
to their nodes.

The tarball on our storage was being accessed by 1500 processes
concurrently on the same machine earlier today and I had already to
replicate 3 times the file to try to spread the load on others. I'm
surprised you didn't have time outs.

cheers
alessandra

On 07/06/2019 14:59, PERRY James wrote:

Hi Alessandra,

We are mostly using CVMFS, but one of the compute nodes in France
doesn't mount our CVMFS repository so we need the tarball for that
one.
Unfortunately because I can't predict when I submit a job whether it
will go to that node or not, all the jobs have the tarball listed
as an
input file. I tried uploading copies to other storage elements as well
when I first put it on the grid, but at the time only Manchester was
working for me. I'm happy to discuss other solutions to this if it's
causing problems.

Thanks,
James

On 07/06/2019 14:52, Alessandra Forti wrote:

Hi James,

can you let me know how you do software distribution? It seems you
have
1 single tarball on the Manchester storage that is creating a large
amount of connections.

They might be among the causes of the current load we are
experiencing.
Manchester isn't running anything at the moment, so either those
are ill
closed connections (could be) or the tar ball you have on the
manchester
storage is the only source access by WNs at other sites in the UK.

We always said that until the software was in development and LSST
run
smaller scale the storage was fine, but it wouldn't work if too many
jobs tried to access the same file on one storage. Have you thought
about using cvmfs or at the very least replicate the tarball at other
sites?

thanks

cheers
alessandra

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
James Perry                Room 2.41, Bayes Centre
Software Architect         The University of Edinburgh
EPCC                       47 Potterrow
Tel: +44 131 650 5173      Edinburgh, EH8 9BT
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
James Perry                Room 2.41, Bayes Centre
Software Architect         The University of Edinburgh
EPCC                       47 Potterrow
Tel: +44 131 650 5173      Edinburgh, EH8 9BT
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--

Respect is a rational process. \\//

For Ur-Fascism, disagreement is treason. (U. Eco)

Use REPLY-ALL to reply to list

To unsubscribe from the LSST-DESC-GRID list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=LSST-DESC-GRID&A=1