Welcome to the class! This is the home page for the Introduction to Biocomputing class. From this page, you can link to all of the content needed to work through the course.

<cmd\> to erase all formatting of line.

We will meet  at FNT 1.104 from Tuesday May 27 to Friday May 30 at 1:30-4:30pm for our class.

You are strongly encouraged to attend in person, but if you would like to attend virtually, use this zoom link: 

https://utexas.zoom.us/j/96012546979?pwd=Xx4Je9KHZVKSFJFhAQwuwWGihL1kyS.1

Zoom Instructions:

Please make sure your zoom version is updated ( and that you have a zoom account )-- this is  required to join a UT-sponsored Zoom session. See this link for more details about zoom requirements:  https://zoom.its.utexas.edu/home

Other setup:

Please make sure you have an ssh client installed on your computer. All macs come with terminal, so no installation is required. For windows laptops, install putty and winSCP.

Course Overview

This is a course designed to give you an introduction to data manipulation and analysis in a hands-on manner. It will comprise of lectures and guided tutorials.

This course has the following objectives: 

  1. To teach you the Unix (Bash) environment and how to work effectively at the command line. 

  2. To teach you about various Unix tools that are available for manipulating files at the command line.

  3. To familiarize you with R and RStudio and how to perform basic data analysis tasks..

  4. To use the above skills in an integrated manner to take data from raw files and produce clean visualizations in R.

Your Instructor

Name

Affiliation

Expertise

How to contact?

Matt Bramble

CBRS, bioinformatics group

Unix, R, Python, epigenetics.

matthew.bramble@austin.utexas.edu or come to FNT1.207C????????

Day 1, 2: Introduction to Unix

Login to the POD

Login to the POD

If you are using a terminal, open the terminal, and type:

ssh username@ls6.tacc.utexas.edu
(replace username with your tacc username)

If you are using putty, open it and enter the following:

hostname:  ls6.tacc.utexas.edu

Upon prompting, enter your username and the password

NEXT:

Open the following link in a new window or tab. Be sure to bookmark this introduction page. We will be returning.

Go to Unix/Linux wiki

If your Terminal has a dark background, the default shell colors can be hard to read. Execute this line to display directory names in yellow.

export LS_COLORS=$LS_COLORS:'di=01;33:'

We'll see later how to set this environment variable in your login script (~/.profile) so that it gets executed every time you login to this server.

For now, just copy the appropriate line above, paste it into your Terminal window (after logging on), then press Enter.

Logging in via RStudio web

If you're attending remotely or do not have access to the UT VPN, you can use the Terminal functionality in the RStudio web application.

To access the Terminal built into RStudio Server.

  • Click on this link: 

  • Enter your studentNN account name and our password, then click the Sign In button

  • In the RStudio web application, select the Terminal tab above the type-in area

You should now see a command line in the RStudio type-in area.

In the RStudio Terminal if the default color for directories is difficult to see against its white background, execute this line to display directory names in blue.

export LS_COLORS=$LS_COLORS:'di=1;34:fi=01:ln=01;36:'

We'll see later how to set this environment variable in your login script (~/.profile) so that it gets executed every time you login to this server.

For now, just copy the appropriate line above, paste it into your Terminal window (after logging on), then press Enter.

Welcome to the class! This is the home page for the Introduction to Biocomputing class. From this page, you can link to all of the content needed to work through the course </layout> </toc>

Welcome to the class! This is the home page for the Introduction to Biocomputing class. From this page, you can link to all of the content needed to work through the

<header designatedusing Tt dropdown heading 1

heading 2

this is an info panel

what? </divider>


me

you

him</table>

me

you

him

</code>

list.files() code snippet for R
  • element 1

  • element2

  • s


this is text

this ie more text </link>

google link

Welcome to the class! This is the home page for the Introduction to Biocomputing class. From this page, you can link to all of the content needed to work through the </quote>

From this page, you can link to all of the content needed to work through the From this page, you can link to all of the content needed to work through the From this page, you can link to all of the content needed to work through the From this page, you can link to all of the content needed to work through the

# UNIX Workshop

Learn the basics of using UNIX from the command line. Introductory topics include the filesystem, the shell, permissions, and text files. The course will touch on manipulating text files using standard UNIX utilities, how to string utilities together, and how to output the results to files. The goal of the course is to develop some basic comfort at the command line, get a sense of what's possible, and learn how to find help.

insert the following as text, NOT code: <cmmd shift v>

## Experience

### Who's Used A Command Line Before?

  • DOS?

  • Linux?

### Who has any programming experience?

## What I Hope This class Accomplishes

## What's The Command Line?

### You Literally Enter Commands Sequentially At A Prompt

### What's Wrong With Using GUIs? Why Don't People Write Their Code So It Uses A Nice Interface?

  • Programming involves writing text files anyway

  • Command line lets you be very precise and flexible at the same time

  • We'll see a little later how the Unix philosophy of lots of small tools gives you flexibility

  • After the initial learning curve (and it's nontrivial), the command line can actually be easier/faster

  • You can work on remote machines from your laptop easily. We'll do that in this class.

## Get Example Files

The pieces will be explained later, and this as a reference

  • command

  • options

  • combined options

*Should look like weird incantations at first, will be explained later.*

### Get/Untar Files

  • `cp ~benni/Classes/IntroUnix/IntroUnixFiles.tar.gz .`

  • `tar -zxvf IntroUnixFiles.tar.gz &`

  • `ls`

  • `rm IntroUnixFiles.tar.gz`

*WTF just happened? Hopefully it will be clear later.*

(Maybe Explain Tar For Grouping Files, gz As Compression?)

## Why Unix? What's Linux?

  • History of Unix slideshow

## Getting Around

  • What is the command line?

  • Using Terminal in Mac (and whatever Windows equivalent is)

  • Show how to turn on Meta/option key in Preferences -> Settings -> Keyboard

  • Explain meta key, and Ctrl-x, C-x, ^x

## Basics Of Files And Directories

### Filesystem

#### Files and directories

  • `ls`

  • `mkdir short_reads`

  • `cd`

  • `pwd` where are we?

  • Go up a level using `..`. Explain `.`.

  • `pwd` again

  • `rm` can't remember how to remove directory, use `-r` (Google "unix remove directory")

  • Switch into a populated directory (show off **tab completion**)

  • Compare `ls`, `ls -l`, `ls -lh`

  • Show that the Mac is a Unix machine, run commands in terminal, show on Desktop

Copy and mov

  • `cp` file

  • `mv` Explain how it is both move and rename.

  • Globbing

+ `*` for arbitrary matches

+ Example of copying a number of files using a glob, or remove a lot of files using glob

+ `mv *.fastq` files to `short_reads` directory

+ `tree`

  • *Probably don't bother with this* `scp` to copy between servers, stands for "secure `cp`" ("secure copy")

+ Back to `gsafcbig01`

+ `~/Classes/IntroUnix`

+ `N_vespilloides_trans.fa`

+ `scp` to `IntroUnix` directory on Stampede

Useful shortcuts

  • Tab autocompletion

  • `history`

  • Up/down arrows

  • `cd -`

  • Basic EMACS commands for getting around lines (don't mention EMACS)

+ `Ctrl-a`, `Ctrl-e`, `Meta-f`, `Meta-b`, `Ctrl-k`

  • `Ctrl-C`

Mention the shell

  • What's a shell?

  • Bash is actually a useful programming language

  • Bash has a print statement, it's called `echo`. Example: `echo walrus`

### Concept of a server, SSH ("secure shell")

  • SSH into Stampede

  • I logged into Stampede without a password because I have "ssh keys" set up.

  • ssh into, say, `gsafcbig01` from Stampede, where I need a password

  • `exit`, back to Stampede.

  • Switch into home directory

  • Explain `~`

  • Show `ls -a`, explain dot files. Point out `.profile_user` or `.bashrc`

  • `pwd`

## How To Get Help

  • I, personally, don't have commands memorized

  • `man` pages, `man ls`

  • `-h`, `--help` (`python --help` as an example)

  • How to Google for answers, example: remove directory

## Manipulating Text Files

  • Viewing a text file

  • `cat` - concatenate files, but it's also good for printing contents of file.

  • `cat jabberwocky.txt`

  • `cat N_vespilloides_trans.fa`

  • `head`, `tail`

  • `cat` on that FASTA file is too long! And it's only a small part of a transcriptome!

  • `head N_vespilloides_trans.fa`

  • Line wrapping. To make what's going on a little clearer: `head jabberwocky.txt` Note some lines are empty.

  • `less` (show on `mobydick.txt`)

  • to quit out of less, use `q`

  • `<space>` to scroll

  • `b` to scroll backwards

  • `shift-G` to jump to end

  • `g` to go to beginning

  • Google `linux less` for options

  • Show off `man`?

  • Google `linux less show line number`

  • `nano`

  • Go into Help

  • explain ^ for control, M for "meta"

  • mention Emacs and vi

  • What's a text file?

  • Unix is based around newline-delimited text files

  • But how are characters specified in binary?

  • `head -3 jabberwocky.txt`

  • `od -t u1 jabberwocky.txt`

  • What's a "string" vs an "integer" vs a "float"

  • Tabs, CRs, invisible characters

  • `\t`, `\n`, `\r`? Unix (computers, actually) treats tabs and newlines as characters, how they're rendered on-screen is special.

  • `od -c jabberwocky.txt`

## Streams And Coreutils

### Unix Pipes And Coreutils

  • Coming from other OSes

+ large applications that try give options to do anything you want for your work, inside the application.

  • Zen of Unix is the reverse

+ Not few large, general programs

+ The Unix way is to make many tiny, specialized programs that you can combine in creative ways.

Coming from a GUI, this way of doing things may seem alien at first. But once it becomes natural, it's a very nice and powerful way of dealing with large data quickly and reproducibly.

### ET Vocalizing?

### Walrus Sounds Example

Find out the distinct walruses

  • `cut -f 1 walrus_sounds.tsv`

  • Let's save that long list to a file: `cut -f 1 walrus_sounds.tsv > walrus_sounds_cutf1.tsv`

  • The same walruses recur, but the names don't occur twice in a row. Explain how `uniq` works, why things need to be sorted.

  • `sort walrus_sounds_cutf1.tsv > walrus_sounds_cutf1_sort.tsv`

  • `uniq walrus_sounds_cutf1_sort.tsv`

Woohoo. We got the distinct walruses, but this is a little clumsy/tedious, and programmers are lazy.

### Standard Streams & Pipes

  • Everything in UNIX is a file

  • `STDIN`, `STDOUT`, `STDERR`

  • Redirection: `>`, `>>` and maybe `<`

  • Stderr `2>`, `2&>1`

  • Pipes

### Walrus Sounds, Revisited

Find the distinct walruses in a single line:

`cut -f 1 walrus_sounds.tsv | sort | uniq`

Distinct vocalizations

  • `cut -f 2 walrus_sounds.tsv | sort | uniq`

  • We can use an option in `uniq` to count how many times each vocalization occurs: `cut -f 2 walrus_sounds.tsv | sort | uniq -c`

### grep

`grep` probably stands for "get regular expression".

  • `grep not haiku.txt`

  • `grep 'is' haiku.txt` I prefer to almost always use quotes

  • `grep -w 'is' haiku.txt` For matching whole words

  • Where did that option come from? `grep --help` (or googling)

  • `grep -i 'is not' haiku.txt` For case insensitivity

  • `grep -c 'is' haiku.txt` to count lines

  • `grep -v 'is' haiku.txt` outputs all lines that don't match, surprisingly useful

  • `grep '^The' haiku.txt` outputs lines that start with "The"

### FASTA file (skip in favor of grep, if time is short)

Maybe do this after grep, since I use grep.

  • What's a FASTA file?

  • How big is the file? `ls -lh` gives us filesize. `wc` gives us something. `wc -l` gives us number of lines. But that isn't the number of sequences.

  • `grep -c '^>'` gives us number of headers

  • brief description of regular expressions (the concept)

  • Count different HGNC orthologs: `grep '^>' emergency_files/N_vespilloides_trans.fa | cut -f 2 | sort | uniq -c | sort -nr`

  • Mention tools like `seqtk`?

History

  • See old commands with `history`

  • Execute previous command with `!<number>`

  • `history | tail`

  • `history | tail -n 50 > recent_commands.txt`

  • `history | grep something`

Random topics

  • **Compression** tar, gz, less can read into a gz file

  • `tar -zxvf`, `gunzip`, `unzip`

  • PATH, environment variables

  • `echo $PATH | tr : '\n'`

  • Package managers (`apt-get` in Ubuntu, etc)

  • Run command in the background.

  • `top`, PID, `kill -9`

  • `nohup`, `screen`