Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

We are developing a cluster for local ATLAS computing using the TACC Rodeo system to boot virtual machines.  If you just want to use the system, see the next section and ignore the rest (which describes the virtual machine setup and is a bit out of date as of Sep 2015).

Transferring data from external sources

The Tier-3 nodes do not directly connect to any storage space.  We can access files via the xrootd protocol from the /data disk that is mounted by all the workstations and utatlas.its.utexas.edu (see below).  So files must first be transferred to the tau workstations or to utatlas.its.utexas.edu.  Methods include:

  • Rucio download for Grid datasets
  • xrootd copy for files on CERN EOS/ATLAS Connect FaxBox/ATLAS FAX (Federated XrootD)
  • /wiki/spaces/utatlas/pages/50626812 for files on ATLAS Connect FaxBox, TACC, or CERN

 

Getting started with Bosco

The Tier-3 uses utatlas.its.utexas.edu as a submission host - this is where the Condor scheduler lives.  However 

Bosco is a job submission manager designed to manage job submissions across different resources.  It is needed to submit jobs from our workstations to the Tier-3.

Make sure you have an account on our local machine utatlas.its.utexas.edu, and that you have passwordless ssh set up to it from the tau* machines.

To do this create an RSA key and copy your .ssh folder onto the tau machine using scp.

 Then carry out the following instructions on any of the tau* workstations:

Code Block
bash
bash
cd ~
curl -o bosco_quickstart.tar.gz ftp://ftp.cs.wisc.edu/condor/bosco/1.2/bosco_quickstart.tar.gz
tar xvzf ./bosco_quickstart.tar.gz
./bosco_quickstart

...

  • The worker nodes do not mount any of our network disks. This is partly for simplicity and robustness, and partly for security reasons.  Because of this your job must transfer files either using the Condor file transfer mechanism (recommended for code and output data) or using the xrootd door on utatlas.its.utexas.edu (which gives read access to /data, through the URL root://utatlas.its.utexas.edu://data/...; recommended for input datasets).  Although this may seem somewhat unfortunate, it's actually a benefit, because any submitted job that runs properly on the Cloud Tier-3 can therefore be flocked to other sites, which obviously don't mount our filesystem, without being rewritten (see Overflowing to ATLAS Connect below).  
  • You must make your data world-readable to be visible through the xrootd door, because the server daemon runs as a very unprivileged user. The command is "chmod -R o+rX ." in the top-level directory above your data (this will fix subdirectories to be world-listable and the files to be world-readable).
  • You must submit jobs in the "grid" universe (again, to enable proper flocking).  In other words, 

    Code Block
    grid_resource = batch condor ${USER}@utatlas.its.utexas.edu

    in your Condor submission file (replace ${USER} with your username).

  • The worker nodes have full ATLAS CVMFS.
  • One common problem is having jobs go into the Held state with no logfiles or other explanation of what's going on. Running condor_q -long <jobid> will give "Job held remotely for with no hold reason."  By far (>99.9%) the most common cause of this is that a file is requested to be transferred back through the file transfer mechanism in the submission file, but is not produced in the job. That is usually caused by the job failing (unable to read input data, crash of code, etc.).  Unfortunately you won't have the logfile, so the easiest way to debug this is to resubmit the job but without the output file transfer specified in the submission script. (This is a very unfortunate and nasty feature of Bosco.)
  • You can request multiple cores for your job, by specifying

    Code Block
    +remote_SMPGranularity = 8
    +remote_NodeNumber = 8

    (for example, if you want 8 cores) in your submission script.

Overflowing to ATLAS Connect

...

Code Block
bash
bash
ssh username@alamo.futuregrid.org

Then visit the list of instances to see which nodes are running. Then simply 

Code Block
bash
bash
ssh root@10.XXX.X.XX

and you are now accessing a node!