Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Info
titleUnderstanding why some files start with a "."

In the above code box, you see that the names start with a . when a filename starts with a . it conveys a special meaning to the operating system/command line. Specifically, it prevents that file from being displayed when you use the ls command unless you specifically ask for hidden files to be displayed using the -a option. Such files are termed "dot-files" if you are interested in researching them further.

Let's look at a few different ways we will use the ls command throughout the course. Compare the output of the following 4 commands:

Code Block
languagebash
titleStandard output
ls              #ignore everything that comes after the # mark. There is a problem on this wiki page but things after a # wont effect commands  


Code Block
languagebash
titleStandard output plus hidden files
ls -a


Code Block
languagebash
titleStandard output plus hidden files in a single column
ls -a -1


Code Block
languagebash
titleStandard output plus hidden files in a single column with additional information
ls -a -l

Throughout the course you will notice that many options are supplied to commands via a single dash immediately followed by a single letter. Usually when you have multiple commands supplied in this manner you can combine all the letters after a single dash to make things easier/faster to type. Experiment a little to prove to yourself that the following 2 commands give the same output.

Code Block
languagebash
titleStandard output plus hidden files in a single column
ls -a -1

ls -al

While knowing that you can combine options in this way helps you analyze data faster/better, the real value comes from being able to decipher commands you come across on help forums, or in publications.

For ls specifically the following association table is worth making note of, but if you want the 'official' names consider using the man command to bring up the ls manual.

flagassociation
-a"all" files
-l"long" listing of file information
-11 column
-hhuman readable


...

Warning

If you see anything besides "tacc:~$" as your prompt, get my attention and be ready to share your screen rather than continuing forward as something has gone wrong.



  • Setting up other shortcuts:

...

  • Linux text editors installed at TACC (nanoviemacs). These run in your terminal window. vi and emacs are extremely powerful but also quite complex, so nano is the best choice as a first local text editor. It is also powerful enough that you can still accomplish whatever you are working on, it just might be more difficult if you try to do more complex edits. If you are already familiar with one of the other programs you are welcome to continue using it. If this is something you plan to use long term, it is worth spending the time to learn to rely on something other than nano after this class.
  • A former lab member suggested that vs code may be the best current platform to combine much of this, and while I trust his experience and suggestion I don't have personal familiarity with it https://code.visualstudio.com/docs/remote/ssh .
  • Text editors or IDEs that run on your local computer but have an SFTP (secure FTP) interface that lets you connect to a remote computer (Notepad++ or Komodo Edit). Once you connect to the remote host, you can navigate its directory structure and edit files. When you open a file, its contents are brought over the network into the text editor's edit window, then saved back when you save the file.
  • Software that will allow you to mount your home directory on TACC as if it were a normal disk e.g. MacFuse/MacFusion for Mac, or ExpanDrive for Windows or Mac ($$, but free trial). Then, you can use any text editor to open files and copy them to your computer with the usual drag-drop.

...

  1. The most important thing to get used to is the convention of using . _  or capitalizing the first letter in each word in names rather than spaces in names, and limiting your use of any other punctuation. Spaces are great for mac and windows folder names when you are using visual interfaces, but on the command line, a space is a signal to start doing something different. Imagine instead of a BioITeam folder you wanted to make it a little easier to read and wanted to call it "Bio I Informatics Team" certainly everyone would agree its easier to read that way, but because of the spaces, bash will think you want to create 3 folders, 1 named Bio another named IInformatics and a third named Team. Now this is certainly behavior you can use when appropriate to your advantage, but generally speaking spaces will not be your friend. Early on in my computational learning I was told "A computer will always do exactly what you told it to do. The trick is correctly telling it to do what you want it to do". 
  2. Name things something that makes it obvious to you what the contents are not just today but next week, next month, and next year even if you don't touch the it for weeks-months-years.
  3. Prefixing file/folder names with international date format (YYYY-MM-DD) will ensure that listing the contents will print in an order in which they were created. This can be useful when doing the same or similar analysis on new samples as new data is generated.

...

Expand
titlePeople always ask "but can you name new directories or files to have spaces in them?"

To answer the question, Yes, files/folders can have spaces. This is hidden away to keep you from accidentally thinking that this is a good idea. LET ME STRESS AGAIN this is a horrible habit to get into and will lead to unforced errors.

Instead let's think about this from the prospective of encountering files or directories that you are working with but didn't create that have spaces in them. Assumably because a colleague who didn't take this course sent you some data, and not because you thought it was a good idea personally. Spaces can be "escaped" like many other special characters. Imagine someone sent you directory name "This is really annoying to use but I don't know it yet" to change into that directory you would have to type:

Code Block
languagebash
cd this\ is\ really\ annoying\ to\ use\ but\ I\ don\'t\ know\ it\ yet

Notice that the apostrophe also had to be escaped, which should help show you not to use other punctuation.

Tip

The tab key would automatically add the escape character for youfor you when there are no other matches to something else in the same path.



  • Understanding TACC

Now that we've been using stampede2 for a little bit, and have it behaving in a way that is a little more useful to us, let's get more of a functional understanding of what exactly it is and how it works.

...

When you log into stampede2 using ssh you are connected to what is known as the login node or "the head node". There are several different head nodes, but they are shared by everyone that is logged into stampede2 (not just in this class, or from campus, or even from Texas, but everywhere in the world). Anything you type onto the command line has to be executed by the head node. The longer something takes to complete, or the more commands you send at once the slower the head node will work for you and everybody else. Get enough people running large jobs on the head node all at once (say a zoom class of summer school students) and stampede2 can actually crash leaving nobody able to execute commands or even log in for minutes -> hours -> perhaps even days if something goes really wrong. To try to avoid crashes, TACC tries to monitor things and proactively stop things before they get too out of hand. If you guess wrong on if something is safe to run on the head node, you may eventually see a message like the one pasted below. If you do, it's not the end of the world, but repeated messages will lead to revoked TACC access and emails where you have to explain what you are doing to TACC and your PI and how you are going to fix it and avoid it in the future.   

Code Block
titleExample of how you learn you shouldn't have been on the head node
Message from root@login1.ls4.tacc.utexas.edu on pts/127 at 09:16 ...
Please do not run scripts or programs that require more than a few minutes of
CPU time on the login nodes.  Your current running process below has been
killed and must be submitted to the queues, for usage policy see
http://www.tacc.utexas.edu/user-services/usage-policies/
If you have any questions regarding this, please submit a consulting ticket.

...

Note that this may not be an inclusive list as it requires the name of the program, or its description to contain the word "alignment". Looking through the results you may notice some of the programs you already know and use for aligning 2 sequences to each other such as blast. Try broadening your results a little by searching for "align" rather than "alignment" to see how important word choice is. When you compare the two sets of results you will see that one of the new results is:

Code Block
bowtiebsmap: bowtiebsmap/2.3.2
	Memory-efficient short read (NGS) aligner

...

92
    BSMAP for Methylation

 If you are sure you know the name of the program you need this list may be sufficient, but if you don't know exactly what you need the limited information available is probably not enough to make a good decision. To learn more about a particular program, try the following 2 commands:

...

For help with the ssh command please refer back to Windows10 or MacOS tutorials. If you log out and back in 1 more time, what do you notice is different?

...

Info
titleConda environments in the instructors work

In my own work, I recently remarked to my PI that "I wish I had started using this 5 years ago", and was reminded that "it didn't exist 5 years ago, at least in its current super usable and popular format". It is entirely possible that future classes will be taught with only minimal references to the TACC module system, and this years course will feature far fewer than any previous year. 

While you may be thinking that since the conda system can work on your personal computer, you may want to just work on your personal computer for the duration of this class and ignore all the ssh commands and working remotely. This is strongly not advised. While you would be able to use the same programs in both instances (in most cases), the tutorials are developed with the speed of the stampede2 system in mind and attempt to minimize "waiting for something to finish" to how long it takes someone to read through the next block of text on the tutorial with some exceptions. If you were to do these tutorials on your personal computer, the timing would significantly increase and it would be difficult to keep up with the rest of the class.

In Friday's lecture I will explain why installing and using conda on your local computer is still a good idea and how I am currently using it in conjuncture with TACC.

...

If you have already activated your GVA-fastqc environment, the first line will not do anything, but if you have not, you will see your promt prompt has changed to now say (GVA-fastqc) on the far left of the line. As to the second command, like we saw with the module system above, things aren't quite this simple. In this particular case, we get a very helpful error message that can guide our next steps:

...

More information about "channels" can be found here. By the end of this course you may find that the 'bioconda' channel is full of lots of programs you want to use, and may choose to permanently add it to your list of channels so the above command conda install fastqc and others used in this course would work without having to go through the intermediate of searching for the specific installation commands, or finding what channel the program you want is in. Information about how to do this, as well as more detailed information of why it is bad practice to go around adding large numbers of channels can be found here. Similarly, when we get to the read mapping tutorial, we will go over the conda-forge channel which is also very helpful to have.

Tip
titlesuggested contents of .condarc file

channels:
  - conda-forge
  - bioconda
  - defaults
channel_priority: strict



For now, use the error message you saw above to try to install the fastqc program yourself.

Expand
titleIf you are having trouble finding the fastqc page on anaconda, the answer is here, as well as a description of the most likely problem you encountered.

https://anaconda.org/bioconda/fastqc

If you were unable to find this page, the most likely error you entered fastqc into the search box, and you recognized that 360650,000+ downloads was likely the program you wanted, you clicked the first bit of hyperlink you found which took you to the bioconda page instead of to the fastqc program. Personally, I think the entire box should be clickable to send you the program page, but nobody has asked me.

...

If all goes well, the installation command should give you output similar to the following output with you answering "y" when prompted if you actually want to install the packages:

...

Genome Variant Analysis Course 2022 2023 home.