The cp command only copies files/directories with the local host's file systems. The scp command is similar to cp, but scp lets you securely copy files from one machine to another. And also like cp, scp has a -r (recursive) option to copy directories.
scp usage is similar to cp in that it copies from a <source> to a <destination>, but uses remote machine addressing to qualify either the <source> or the <destination>.
Remote machine addressing looks like this:
<user_account>@<hostname>:<source_or_destination>
Examples:
Open a new Terminal (Mac) or Command Prompt (Window) window on your local computer (not logged in to your student account), and try the following, using your studentNN account and GSAF pod host gsafcomp02.ccbb.utexas.edu. (This example will not work if you're using the RStudio Terminal because you don't have access to the UT VPN service.)
Note that you will always be prompted for your credentials on the remote host when you execute an scp command.
To copy a remote file:
# On your local computer - not gsafcomp02 # Be sure to use your assigned student account and hostname # Copy "haiku.txt" from your remote student Home directory to # the Home directory on your computer scp student01@gsafcomp02.ccbb.utexas.edu:~/haiku.txt . # Copy "haiku.txt", now in your local current directory, # to your remote student Home directory with the name "h2.txt" scp ./haiku.txt student01@gsafcomp02.ccbb.utexas.edu:~/h2.txt |
To copy a remote directory:
# On your local computer - not gsafcomp02 # Be sure to use your assigned student account and hostname # Copy the "docs" directory and its contents from your remote # student Home directory to a local sub-directory called "local_docs" scp -r student01@gsafcomp02.ccbb.utexas.edu:~/docs/ ./local_docs/ # copy the "local_docs" sub-directory in your local current directory, # to your remote student Home directory with the name "remote_docs" scp -r ./local_docs/ student01@gsafcomp02.ccbb.utexas.edu:~/remote_docs/ |
When transferring files between your computer and a remote server, you always need to execute the command on your local computer. This is because your personal computer does not have an entry in the global hostname database, whereas the remote computer does. The global Domain Name Service, or DNS database maps full host names to their IP (Internet Protocol) address. Computers that can be accessed from anywhere on the Internet have their host names registered in DNS. |
The wget <url> command lets you retrieve the contents of a valid Internet URL (e.g. http, https, ftp).
Example:
# Make a new "wget" directory in your student Home directory and # change into it mkdir -p ~/wget; cd ~/wget # Download a Gencode statistics file using default output file naming wget "https://ftp.ebi.ac.uk/pub/databases/gencode/_README_stats.txt" wc -l _README_stats.txt # If you execute the same wget again, and the output file already exists # wget will create a new one with a numeric extension wget "https://ftp.ebi.ac.uk/pub/databases/gencode/_README_stats.txt" wc -l _README_stats* # download the same Gencode statistics file to a different local filename wget -O gencode_stats.txt \ "https://ftp.ebi.ac.uk/pub/databases/gencode/_README_stats.txt" wc -l gencode_stats.txt |
Because a lot of scientific data is large, it is often stored in a compressed format to conserve storage space. The most common compression program used for individual files is gzip whose compressed files have the .gz extension. The tar and zip programs are most commonly used for compressing directories.
Let's see how that works by using a small FASTQ file that contains NGS read data where each sequence is represented by 4 lines.
# copy a small.fq file into a new ~/gzips directory cd; mkdir gzips cp -p /stor/work/CCBB_Workshops_1/misc_data/fastq/small.fq ~/gzips/ cd ~/gzips ls -lh # small.fq is 66K (~66,000) bytes long wc -l small.fq # small.fq is 1000 lines long |
By default, when you call gzip on a file it compresses it in place, creating a file with the same name plus a .gz extension.
gzip small.fq # Compress the small.fq file in place, producing
# a small.fq.gz file (and removing small.fq)
ls -lh # small.fq.gz is only 15K bytes -- 4x smaller! |
The gunzip command does the reverse – decompresses the file and writes the results back without the .gz extension. gzip -d (decompress) does the same thing.
# decompress the small.fq.gz file in place, producing small.fq file gunzip small.fq.gz # or gzip -d small.fq.gz |
Both gzip and gunzip also have -c or --stdout options that tell the command to write on standard output, keeping the original files unchanged.
cd ~/gzips # change into your ~/gzips directory
ls small.fq # make sure you have an uncompressed "small.fq" file
gzip -c small.fq > sm2.fq.gz # compress the "small.fq" into a new file
# called "sm2.fq.gz"
gunzip -c sm2.fq.gz > sm3.fq # decompress "sm2.fq.gz" into a new
# "sm3.fq" file
ls -lh |
Both gzip and gunzip can also accept data on standard input. In that case, the output is always on standard output.
cd ~/gzips # change into your ~/gzips directory ls small.fq # make sure you have an uncompressed "small.fq" file cat small.fq | gzip > sm4.fq.gz |
The good news is that most bioinformatics programs can accept data in compressed gzipped format. But how do you view these compressed files?
Here are some ways to work with a compressed file:
cd ~/gzips
cat ../jabberwocky.txt | gzip > jabber.gz # make a compressed copy of
# "jabberwocky.txt"
less jabber.gz # use 'less' to view compressed
# "jabber.gz" (type 'q' to exit)
zcat jabber.gz | wc -l # count lines in the compressed
# "jabber.gz" file
zcat jabber.gz | tail -4 # view the last 4 lines of the
# "jabber.gz" file
zcat jabber.gz | cat -n # view "jabber.gz" text with
# line numbers (zcat does not
# have an -n option)
|
Display lines 7 - 9 of the compressed "jabber.gz" text
zcat jabber.gz | cat -n | tail +7 | head -3 |
The find command is a powerful – and of course complex! – way of looking for files in a nested directory hierarchy. The general form I use is:
find <in_directory> [ operators ] -name <expression> [ tests ]
Examples:
cd
find . -name "*.txt" -type f # find all .txt files in the Home directory
find . -name "*docs*" -type d # find all directories with "docs"
# in the directory name |
The /stor/work/CBRS_unix/fastq/ directory contains sequencing data from a GSAF Job. Its structure, as shown by tree, is:
![]()
Use find to find all fastq.gz files in /stor/work/CBRS_unix/fastq/.
find /stor/work/CBRS_unix/fastq/ -name "*.fastq.gz" -type f |
How many fastq.gz files in /stor/work/CBRS_unix/fastq/ were run in sequencer lane L001.
find /stor/work/CBRS_unix/fastq/ -name "*L001*fastq.gz" -type f | wc -l |
How many sample directories in /stor/work/CBRS_unix/fastq/ were run in 2020?
find /stor/work/CBRS_unix/fastq/ -name "*2020*" -type d | wc -l |
When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links (symlinks) to those files in a directory where you plan to work with them. You can use them in your analysis as if they were local to your working directory, without the storage cost of copying them.
Storage is a limited resources, so never copy large data files! Create symbolic links to them in your analysis directory instead. |
The ln -s <path_to_link_to> [ link_file_name ] command creates a symbolic link to <file_to_link_to>.
Examples:
# create a symlink to the ~/haiku.txt file using relative path syntax mkdir -p ~/syms; cd ~/syms ln -s -f ../haiku.txt ls -l |
The ls -l long listing in the ~/syms directory displays the symlink like this:
![]()
lrwxrwxrwx) has an l in the left-most file type position, indicating this is a symbolic link.Now create a symlink to a non-existent file:
# Create a symlink to a non-existent "../xxx.txt" file, # naming the symlink "bad_link.txt" cd; mkdir -p ~/syms; cd ~/syms ln -sf ../xxx.txt bad_link.txt ls -l |
Now both the symlink and the linked-to file are displayed in red, indicating a broken link.
![]()
Multiple files can be linked by providing multiple file name arguments along and using the -t (target) option to specify the directory where links to all the files can be created.
# create a multiple symlinks to the *.bed files in the # ~/data/bedfiles/ directory # -t . says create all the symlinks in the current directory cd; mkdir -p ~/syms; cd ~/syms ln -sf -t . ../data/bedfiles/*.bed ls -l |
What about the case where the files you want are scattered in sub-directories? Consider a typical GSAF project directory structure, where FASTQ files are nested in sub-directories:
![]()
Here's a solution using find and xargs:
mkdir -p ~/syms/fa; cd ~/syms/fa find /stor/work/CBRS_unix/fastq -name "*.gz" | xargs ln -sf -t . |
Step by step:
Create symbolic links to the directories in /stor/work/CBRS_unix/fastq/ that were sequenced in 2020.
find /stor/work/CBRS_unix/fastq/ -name "*2020*" -type d |
ln wants its pathnames in its argument list, not on standard input. xargs <cmd> takes data on its standard input and puts it on the argument list of <cmd>. |
man ln (3rd form) says:
This tells us that when there are multiple pathname arguments, links to each will be created in the directory named by the last pathname. But we want to create symbolic links in our ~/syms/fa directory. So we can't just pipe output from find to xargs ln -sf, or a link to the 1st directory returned by find, would be created in the 2nd directory. |
man ln (4th form) says:
This tells us that we can specify the directory where the link is created using the -t option. Then all the pathnames on the argument list will be created in that directory. |
mkdir -p ~/syms/fa; cd ~/syms/fa find /stor/work/CBRS_unix/fastq/ -name "*2020*" -type d | xargs ln -sf -t . |
Note that other Unix commands, such as mv and cp, also allow a directory to be specified as an option rather than an argument.