Linux and Lonestar 5
- 1 Overview:
- 2 Objectives:
- 3 Tutorial:
- 3.1.1 How to log into lonestar
- 3.1.2 Use ls to check if particular file exists
- 3.1.3 Use mv to change your .profile file to a backup copy
- 3.1.4 Create new directories on your work partition named src and BioITeam
- 3.1.5 List of commands to copy
- 3.1.6 Copy the course provided .profile file and change its name and permissions
- 3.1.7 How to see hidden and not hidden files in linux
- 3.1.8 How to see details about hidden and not hidden files in linux
- 3.1.9 How to leave Lonestar by logging out
- 3.1.10 Go log back in to Lonestar
- 3.1.11 Creating a shortcut to the main Lonestar working directories
- 3.1.12 Print the contents of the .profile file to the screen
- 3.1.13 How to start the nano text editor
- 3.1.14 Redirecting STDOUT
- 3.1.15 Piping one command's output to another, and then redirecting STDOUT to a file
- 3.2 Diagram of Lonestar5 directories: What connects to what, how fast, and for how long.
- 3.3 Understanding "jobs" and compute nodes.
- 3.4 Running a job
- 3.5 Interrogating the launcher queue
- 3.6 Evaluating your first job submission
- 3.7 Moving beyond the preinstalled commands on TACC
Overview:
This portion of the class is devoted to making sure we are all starting from the same starting point on lonestar. This tutorial is adapted from a previous version which allowed for set up on the now decommissioned lonestar4. Portions of this tutorial were adapted from previous versions which can be found here, here, here, here, here, and here. Collective thanks to all those that contributed to those works which now appear in a single version. Anyone wishing to use this tutorial is welcome.
Objectives:
Log into lonestar5.
Change your lonestar profile to the course specific format.
Refresh understanding of basic linux commands with some course organization.
Review use of the nano text editor program, and become familiar with several other text editor programs.
Tutorial:
Logging into lonestar5
Start a new terminal window. For MACs this is done by clicking on the magnifying glass on the right hand side of the toolbar at the top of the page and type "terminal". For windows this should be done by connecting through cygwin. Log into lonestar using your account information.
This brings us to our first "code block". There will be 3 types of code blocks used throughout this class:
Visible
These are code blocks that you would have no idea what to type without help.
These will typically be associated with longer/more detailed text above the text box explaining things.
Hinted
These are code blocks that you can probably figure out what to type with a hint that goes beyond what the tutorial is requesting. Access the hint by clicking the triangle or hint hyperlink.
These will always contain an additional hidden code block incase you don't find the hint as clever as we did.
Hidden
These code blocks represent things that either there is a good chance you know how to do already, something too straightforward to warrant a hint, or are there to give you the answer if the hint doesn't help. Access the answer by clicking "expand source" on the right hand side of the code block.
Text inside of code blocks represent "right" answers, and should either be typed EXACTLY into the terminal window as they are, or copy pasted with a noteable exception. Things that exist within <> symbols represent something that you need to replace before sending it to the terminal. We try to put informative text within the brackets so you know what to replace it with. If you are ever unsure of what to replace the <> text with, just ask.
Using what we have just taught you about code blocks, log into lonestar. Since this is your first code box, it is probably worth expanding even if you know how to log into lonestar already.
How to log into lonestar
ssh <username>@ls5.tacc.utexas.eduWhen prompted enter your password, and answer "yes" to the security question.
Logging into remote computers
As a matter of internet safety, the terminal window knows you are entering a password and may not want your neighbor to see what it is. For this reason, even as you type to enter your password, nothing will be displayed on the screen. While backspace will work if you know you made a mistake, we often find it better to just hit enter and try again.
If you have never logged into lonestar from the computer you are currently using before, you will be issued a security warning. The same will be true if you log into any of the other TACC resources, or any other remote computer. If you ever see a security warning logging into somewhere that you use commonly you should answer no and try to figure out why you were warned. Otherwise type "yes" to bypass the security check.
Setting up your lonestar profile and other variables
There are many flavors of Linux/Unix shells. The default for TACC's Linux (and most other Linuxes) is bash (bourne again shell), which we will use throughout.
Whenever you login via an interactive shell as you did above, a well-known script is executed by the shell to establish your favorite environment settings. We've set up a common profile for you to start with that will help you know where you are in the file system and make it easier to access some of our shared resources. If you already have a profile set up on lonestar that you like, we want to make sure that we don't destroy it but it will be important to make sure that we change it temporarily. Use the ls command to check if you have a profile already set up in your home directory.
Use ls to check if particular file exists
cdh
ls .profile
ls .bashrcIf you already have a .profile or .bashrc file, use the mv command to change the name to something descriptive (for example ".profile_pre_bdib_backup"). Otherwise continue to creating a new files.
Use mv to change your .profile file to a backup copy
mv .profile .profile_pre_bdib_backup
mv .bashrc .bashrc_pre_bdib_backupThe BioITeam has several useful programs, libraries, and scripts globally available on the head node, but these useful things are not available from any of the compute or interactive nodes. We will explain more about this soon, but for the time being just know that there are things that you only sometimes have access to currently, and we want you to have access to them all the time so we have to copy some things into specific locations to make sure everyone is working with the same set up throughout the course. After you have finished taking the course you may find additional useful things in the BioITeam locations, and the things that you copy may get updated from time to time. On the last day of the course we'll go through how to sync the things you have copied and how to access additional community tools that we won't use in this course, so if you like foreshadowing, you are welcome.
We will explain more about what the different areas of tacc are shortly, but for now we are going to execute some commands that will make it a bit more useful. First we will make 2 new directories on the $WORK partition to serve as locations to copy things from the BioITeam. Using the mkdir command, create a new directory named src and a directory inside of the src directory named BioITeam.
Create new directories on your work partition named src and BioITeam
cd $WORK
mkdir src
mkdir src/BioITeamYou may have noticed that we executed the mkdir commands sequentially. This is done to make sure that the directory exists before trying to put a new directory inside of it. This leads us to an interesting and important thing to consider. How should we name files and folders? In general you will want to adopt a consistent pattern of naming, and it should be your own and something that makes sense to you. The most important thing to get used to is the convention of using . or _ in names rather than spaces in names, and limiting your use of any other punctuation. Spaces are great for mac and windows folder names when you are using visual interfaces, but on the command line, a space is a signal to start doing something different. Imagine instead of a BioITeam folder you wanted to make it a little easier to read and wanted to call it "Bio I Team" certainly everyone would agree its easier to read that way, but because of the spaces, bash will think you want to create 3 folers, 1 named Bio another named I and a third named Team. Now this is certainly behavior you can use when appropriate to your advantage, but generally speaking spaces will not be your friend. Early on in my computational learning I was told "A computer will always do exactly what you told it to do. The trick is telling it to do what you want it to do".
Now that we have the directories created to copy BioITeam materials into lets copy the bin, python2.7, lib, local, and perl5 directories from the /corral-repl/utexas/BioITeam directory to your $WORK/src/BioITeam directory. Remember, that you want to copy them recursively so you get all the contents of those folders as well.
List of commands to copy
cd $WORK/src/BioITeam
cp -r /corral-repl/utexas/BioITeam/bin .
cp -r /corral-repl/utexas/BioITeam/python2.7 .
cp -r /corral-repl/utexas/BioITeam/lib .
cp -r /corral-repl/utexas/BioITeam/local .
cp -r /corral-repl/utexas/BioITeam/perl5 .
cp -r /corral-repl/utexas/BioITeam/breseq .Some of these copy commands may take a few minutes to complete (the bin directory specifically) and you may see some permissions errors such as the following. This is expected and not concerning.
cp: cannot open `/corral-repl/utexas/BioITeam/bin/smrtanalysis-2.0.1/analysis/lib/python2.7/networkx-1.1-py2.7.egg/networkx/drawing/nx_pydot.pyc' for reading: Permission denied
When the last of the above commands has finished, copy our predefined GVA2016.bashrc file from the /corral-repl/utexas/BioITeam/scripts/ folder to your $HOME folder as .bashrc before using the chmod command to change the permissions to read and write for the user only.
Copy the course provided .profile file and change its name and permissions
cp /corral-repl/utexas/BioITeam/scripts/GVA2016.bashrc .bashrc
cp /corral-repl/utexas/BioITeam/scripts/GVA2016.profile .profile
chmod 700 .bashrc
chmod 700 .profileThe chmod 700 <FILE> command marks the file as readable/writable/executable only by you. The .bashrc script file will not be executed unless it has these permissions settings.
Notice that when you do a normal ls to list the contents of your home directory, this file doesn't appear. That's because it's a hidden "dot file" – a file that has no filename, only an extension. To see these hidden files use the -a (all) switch for ls:
How to see hidden and not hidden files in linux
ls -a
To see even more detail, including file permissions, add the -l (long listing) switch:
How to see details about hidden and not hidden files in linux
ls -laSince .bashrc is executed when you login, to ensure it is set up properly you should first logout:
How to leave Lonestar by logging out
exit
then log back in:
Go log back in to Lonestar
ssh <username>@ls5.tacc.utexas.edu
If everything is working correctly you should now see a prompt like this: tacc:~$
In order to make navigating to the different file systems on lonestar a little easier ($SCRATCH and $WORK), you can set up some shortcuts with these commands that create folders that "link" to those locations. Run these commands when logged into Lonestar with a terminal, from your home directory.
Creating a shortcut to the main Lonestar working directories
cdh
ln -s $SCRATCH scratch
ln -s $WORK work
ln -s $BI BioITeam
Understanding what your .bashrc file actually does.
Editing files
There are a number of options for editing files at TACC. These fall into three categories:
Linux text editors installed at TACC (nano, vi, emacs). These run in your terminal window. vi and emacs are extremely powerful but also quite complex, so nano may be the best choice as a first local text editor.
Text editors or IDEs that run on your local computer but have an SFTP (secure FTP) interface that lets you connect to a remote computer (Notepad++ or Komodo Edit). Once you connect to the remote host, you can navigate its directory structure and edit files. When you open a file, its contents are brought over the network into the text editor's edit window, then saved back when you save the file.
Software that will allow you to mount your home directory on TACC as if it were a normal disk e.g. MacFuse/MacFusion for Mac, or ExpanDrive for Windows or Mac ($$, but free trial). Then, you can use any text editor to open files and copy them to your computer with the usual drag-drop.
We'll go over nano together in class, but you may find these other options more useful for your day-to-day work so feel free to go over these sections in your free time to familiarize yourself with their workings to see if one is better for you.
As we will be using nano throughout the class, it is a good idea to review some of the basics. nano is a very simple editor available on most Linux systems. If you are able to use ssh, you can use nano. To invoke it, just type:
How to start the nano text editor
nano
You'll see a short menu of operations at the bottom of the terminal window. The most important are:
ctl-o - write out the file
ctl-x - exit nano
You can just type in text, and navigate around using arrow keys. A couple of other navigation shortcuts:ctl-a - go to start of line
ctl-e - go to end of line
Be careful with long lines – sometimes nano will split long lines into more than one line, which can cause problems in our commands files, as you will see.
Stringing commands together and controlling their output
In a linux shell, it is often useful to take output of one command save it to a new file rather than having it print to the screen. It uses a familiar metaphor: "pipes". The linux operating system expects some "standard input pipe" and gives output back through a "standard output pipe". These are called "stdin" and "stdout" in linux. There's also a special "stderr" for errors; we'll ignore that for now. Usually, your shell is filling the operating system's stdin with stuff you type - the commands with options. The shell passes responses back from those commands to stdout, which the shell usually dumps to your screen. The ability to switch stdin and stdout around is one of the key reasons linux has existed for decades and beat out many other operating systems. Let's start making use of this. Change to the scratch directory and make a new folder called "piping" and put list of the full contents of the $BI folder to a new file called whatsHere.
Redirecting STDOUT
cds
mkdir piping
ls -1 $BI > whatsHere
cat whatsHereWhen you execute the ls -1 > whatsHere command, you should have noticed nothing happening on the screen, and when you cat the whatsHere file, you should notice the output you would have expected from the ls -1 > whatsHere command. Often it is useful to chain commands together using the output of the first command as the input of the second command. Commands are chained together using the "|" character (shift \ above the return key). Use redirection to put the first 2 lines of the $BI directory contents into the whatsHere file.
Piping one command's output to another, and then redirecting STDOUT to a file
ls -1 $BI| head -2 > whatsHere
cat whatsHereAgain, you should see your answer only showing up after the cat command. Note that by using a single > you are overwriting the existing contents and that there is no warning that this is happening beware of this in the future as linux doesn't have an "undo" feature. We will make use of the redirect output (stdout) character (>), and the "pass output along as input" "|" throughout the course. Not all shells are equal - the bash shell lets you redirect stdout with either > or 1>; stderr can be redirected with 2>; you can redirect both stdout and stderr using &>. If these don't work, use google to try to figure it out. The web site stackoverflow is a usually trustworthy and well annotated site for OS and shell help.
Understanding TACC
Now that we've been using lonestar for a little bit, and have it behaving in a way that is a little more useful to us, let's get more of a functional understanding of what exactly it is and how it works.
Diagram of Lonestar5 directories: What connects to what, how fast, and for how long.
Lonestar is a collection of 1,252 computers with 24 cores connected to three file servers, each with unique characteristics. You need to understand the file servers to know how to use them effectively.
| $HOME | $WORK | $SCRATCH |
|---|---|---|---|
Purged? | No | No | Files can be purged if not accessed for 10 days. |
Backed Up? | Yes | No | No |
Capacity | 5GB | 1TB | Basically infinite. |
Commands to Access | cdh cd $HOME/ | cdw cd $WORK/ | cds cd $SCRATCH/ |
Purpose | Store Executables | Store Files and Programs | Run Jobs |
Executables that aren't available on TACC through the "module" command should be stored in $HOME.
If you plan to be using a set of files frequently or would like to save the results of a job, they should be stored in $WORK.
If you're going to run a job, it's a good idea to keep your input files in a directory in $WORK and copy them to a directory in $SCRATCH where you plan to run your job.
Example command for copying data from a $WORK directory to $SCRATCH
cp $WORK/my_fastq_data/*fastq $SCRATCH/my_project/
Understanding "jobs" and compute nodes.
When you log into lonestar using ssh you are connected to what is known as the login node or "the head node". There are several different head nodes, but they are shared by everyone that is logged into lonestar (not just in this class, or from campus, or even from texas, but everywhere in the world). Anything you type onto the command line has to be executed by the head node. The longer something takes to complete, or the more it will slow down you and everybody else. Get enough people running large jobs on the head node all at once (say a classroom full of Big Data in Biology summer school students) and lonestar can actually crash leaving nobody able to execute commands or even log in for minutes -> hours -> even days if something goes really wrong. To try to avoid crashes, TACC tries to monitor things and proactively stop things before they get too out of hand. If you guess wrong on if something should be run on the head node, you may eventually see a message like the one pasted below. If you do, its not the end of the world, but repeated messages will become revoked TACC access and emails where you have to explain what you are doing to TACC and your PI and how you are going to fix it and avoid it in the future.
Example of how you learn you shouldn't have been on the head node
Message from root@login1.ls4.tacc.utexas.edu on pts/127 at 09:16 ...
Please do not run scripts or programs that require more than a few minutes of
CPU time on the login nodes. Your current running process below has been
killed and must be submitted to the queues, for usage policy see
http://www.tacc.utexas.edu/user-services/usage-policies/
If you have any questions regarding this, please submit a consulting ticket.So you may be asking yourself what the point of using lonestar is at all if it is wrought with so many issues. The answer comes in the form of compute nodes. There are 1,252 compute nodes that can only be accessed by a single person for a specified amount of time. These compute nodes are divided into different queues called: normal, development, largemem, etc. Access to nodes (regardless of what queue they are in) is controlled by a "Queue Manager" program. You can personify the Queue Manager program as: Heimdall in Thor, a more polite version of Gandalf in lord of the rings when dealing with with the balrog, the troll from the billy goats gruff tail, or any other "gatekeeper" type. Regardless of how nerdy your personification choice is, the Queue Manager has an interesting caveat: you can only interact with it using the sbatch command. "sbatch <filename.slurm>" tells the que manager to run a set job based on information in filename.slurm (i.e. how many nodes you need, how long you need them for, how to charge your allocation, etc). The Queue manager doesn't care WHAT you are running, only HOW to find what you are running (which is specified by a setenv CONTROL_FILE commands line in your filename.slurm file). The WHAT is then handled by the file "commands" which contains what you would normally type into the command line to make things happen.
Further sbatch reading
To make things easier on all of us, there is a script called launcher_creator.py that you can use to automatically generate a .slurm file. This can all be summarized in the following figure:
Using launcher_creator.py
We have created a Python script called launcher_creator.py that makes creating a .slurm file a breeze. Before learning to work with interactive compute nodes during the class, we will show you how you will most often do your analysis. Run the launcher_creator.py script with the -h option to show the help message so we can see what other options the script takes: