Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
/
Scott's list of linux one-liners

    Scott's list of linux one-liners

    Mar 02, 2013

    Scott's list of linux one-liners

    This is a selection of linux one-liners useful in NGS, roughly categorized. Almost none of these will run "as-is" - some are cut straight from shell scripts and so refer to variables, others have file names which you'd need to customize.

    Note also that some of the grep strings are hardwired to the UT Illumina sequencer - the introductory "@HWI" of a fastq file is instrument-specific. You can't just use '^@' by itself to grep a fastq because some fraction of the ASCII-encoded quality values start with "@".

    These are intended to be useful and general-purpose; you can do almost all of these examples in a more optimized way, in 10 different ways, with other tools, Python, Perl, etc. etc. But that's not the point.

    The point for me is to have a set of core commands I know well enough that they will give me the answer I want, correctly, and quickly to a very wide variety of questions.

    The core commands - all of these have lots of options, making them powerful and general.

    • grep uses regular expressions to search for strings in files. Some good online resources to understand it are at robelle.com and at thegeekstuff.com (http://www.thegeekstuff.com/2011/01/advanced-regular-expressions-in-grep-command-with-10-examples-–-part-ii/)
    • awk is an actual programming language, like perl or python. But it's built-in to linux, is very general-purpose, and is ideal for parsing & filtering large text files.
    • uniq simply recognizes when two or more adjacent lines are the same. Some useful switches: -c counts how many identical lines there were and prepends the count to the line itself, -w N only uses the first N characters of each line to decide if they're the same.
    • sort orders a bunch of lines. It's very fast and memory efficient - sorting a few million lines on a current linux system should only take a few minutes.

    There are others I use often, but not as much as these four. Examples: sed, paste, join, cut, find.

    Fasta/fastq file characterization, reformatting, etc.

    Find out if you have some unusually abundant sequences:

    head -100000 seqs.fastq | grep -A 1 '^@HWI' | grep -v '^@HWI' | sort | uniq -c | sort -n -r | head
    

    This will produce a ranked list of the most abundant sequences from among the first 25,000 sequences (remember 4 lines per sequence in a fastq).

    It can be easily to fool this algorithm, especially if your sequences are long and have higher error rate at the end, in which case you can adjust the uniq command to, say, the first 30 bp:

    head -100000 seqs.fastq | grep -A 1 '^@HWI' | grep -v '^@HWI' | sort | uniq -c -w 30 | sort -n -r | head
    

    If you are expecting a constant motif somewhere in all your data, it's a good idea to see if it's there. Just add an awk to snip out where the CR should be (bp 12 to 22 in this example):

    head -100000 seqs.fastq | grep -A 1 '^@HWI' | grep -v '^@HWI' | awk '{print substr($1,12,10)}' | sort | uniq -c | sort -n -r | head
    

    Take two fastq files and turn them into a paired fasta file - drop the quality info and put each read pair one after the other:

    paste $read1s1 $read2s1  | awk 'BEGIN {c=0} {c++; if (tag!="") {print tag "\n" $1 "\n" tag "\n" $2; tag=""} if (substr($1,1,4)=="@HWI") {tag=$1}}' \
    | sed s/'^@HWI'/'>HWI'/  > $output.paired_s1.fasta 2> $output.log &
    

    If you'd like to filter only on read pairs with quality values no smaller than 10, it's a bit more work:

    paste $read1s1 $read2s1  | \
    awk 'BEGIN {c=0} {c++; if (c==4) {print $1 "\t" $2 "\t" seq1 "\t" seq2 "\t" tag; c=0} if (c==2) {seq1=$1; seq2=$2;} if (c==1) {tag=$1}}' \
    | awk 'BEGIN \
      { convert="!\"#$%&\x27\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f\x30\x31\x32\x33\x34\x35\x36\x37\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f\x40"; \
        convert=convert "ABCDEFGHIJKLMNOPQRSTUVWXYZ" } \
      { badQual=0; \
         for (i=1;i<=length($1);i++) {if (index(convert, substr($1,i,1))-1<10) {badQual=1} }; \
         for (i=1;i<=length($2);i++) {if (index(convert, substr($2,i,1))-1<10) {badQual=1} }; \
      if (badQual!=1) { print $5 "\n" $3 "\n" $5 "\n" $4 }  }'
    | sed s/'^@HWI'/'>HWI'/  > $output.paired_s1.fasta 2> $output.log &
    

    Note that you could have combined the two awk commands - I don't like to do that because it's harder for me to see what I'm doing, and harder for me to copy it for the next problem I need to solve. Like that handy awk script to convert ASCII quality values to numbers?

    Sam file utilities (samtools v0.5.19)

    If you have, or need, bam files use samtools to interconvert:
    BAM->SAM:

    samtools view <infile.bam> > outfile.sam
    

    SAM->BAM:

    samtools view -S -b <infile.sam> > outfile.bam
    

    If you need to globally change the names of all your references in a sam file, you can use something like this:

    cat notrrna.genome.sam | awk 'BEGIN {FS="\t"; OFS="\t"} \
       {if ($3=="mitochondria") {$3="ChrM"} if ($3=="chloroplast") {$3="ChrC"} \
       if ($3=="2") {$3="Chr2"} if ($3=="1") {$3="Chr1"}  if ($3=="3") {$3="Chr3"}  \
       if ($3=="4") {$3="Chr4"}  if ($3=="5") {$3="Chr5"}  \
       print}' > notrrna.genome.tair.sam
    

    Of course you can also use samtools reheader to accomplish the same thing. It works more reliably on large .bam files with many contigs.

    If you only need to change the name of one reference, I'd use sed and a trick like this to pipe straight from binary to text and back to binary:

    samtools view -h notrrna.genome.bam | sed s/'mitochondria'/'ChrM'/g | samtools view -S -b - > blah.bam
    

    You can put any combination of grep/sed/awk you like in between the two samtools commands; this might be useful if you have a mixed population. You should map all the data at once to a concatenated reference, but then might want to separate the BAM file into two - one for organism A and one for organism B.

    You can also use this for things like searching for indels quickly by scanning the CIGAR string, or parsing any of the custom fields that your mapper/genotyper have produced.

    Convert BAM file back to fastq
    samtools view input.bam | \
       awk 'BEGIN {FS="\t"} {print "@" $1 "\n" $10 "\n+\n" $11}' > output.bam.fastq
    

    You can also do this with bamtools and picard.

    This seqanswers post has a great two-liner to add a read group to a BAM file, and contrasts it with a perl program (many more than 2 lines) to do the same thing.

    If you'd like to find out the most abundant sequence start site, try this:

    cat in.sam | head -100000 | awk '{print $3"_"$4}' | sort | uniq -c | sort -n -r | head
    

    Note that this samples just the first 100,000 mapped sequences - works fine on an unsorted sam file, but doesn't make sense on a sorted bam/sam file. On a sorted file, you have to process the whole file. Even so, doing this command on a few tens of millions of sequences should take only a few minutes.

    If you'd like to join mated sequences in a sam/bam file onto one line (and discard the unmated sequences) so you can process them jointly, this method is pretty robust:

    cat in.sam | awk '{if (and(0x0040,$2)&&!(and(0x0008,$2))&&!(and(0x0004,$2))) {print $1 "\t" $2 "\t" $3"\t"$4}}' > r1
    cat in.sam | awk '{if (and(0x0080,$2)&&!(and(0x0008,$2))&&!(and(0x0004,$2))) {print $1 "\t" $2 "\t" $3"\t"$4}}' > r2
    paste r1 r2 > paired.out
    

    It uses the SAM file's flag field (field 2) and some bitwise operations to confirm that both the read and it's mate are mapped, puts the separate alignments into separate files, then uses paste to join them. Since mappers like BWA may output a variable number of fields for each read, I've just used the first four fields. You can also do this with a one-line awk command and do more processing within that command if you want to (left as an exercise to the reader).

    Exploring vcf files

    It's very easy to get some quick stats on VCF files with linux utilities.

    Count all the variants called in all the vcf files
    cat *.vcf | grep -v '^#' | wc -l
    
    Count all the variants in at least two vcf files
    cat *.vcf | grep -v '^#' | awk '{print $1 "\t" $2 "\t" $5}' | sort | uniq -d | wc -l
    
    Count all the variants in three vcf files
    cat *.raw.vcf | grep -v '^#' | awk '{print $1 "\t" $2 "\t" $5}' | sort | uniq -c | grep ' 3 ' | wc -l
    
    How many indels were called in a vcf?
    grep -c INDEL *.vcf
    
    Histogram of allele frequencies
    cat *.vcf | awk 'BEGIN {FS=";"} {for (i=1;i<=NF;i++) {if (index($i,"AF1")!=0) {print $i} }}' | \
    awk 'BEGIN {FS="="} {print int($2*10)/10}' | sort | uniq -c | sort -n -r | head
    

    Benni's Addenda

    FASTA/FASTQ Wrangling

    Take Illumina CASAVA >= 1.8 and add /1 or /2 to the end of the first field of the tag, and throw out the second field of the tag.

    Add paired-end tags to CASAVA 1.8 FASTQ files
    gawk '{print((NR % 4 == 1) ? $1"/1" : $0)}' Sample1_R1.fq > Sample1_newtags_R1.fq
    gawk '{print((NR % 4 == 1) ? $1"/2" : $0)}' Sample1_R2.fq > Sample1_newtags_R2.fq
    

    If a FASTQ file has paired-end reads intermingled, and you want to separate them into separate /1 and /2 files. This script assumes the /1 reads precede the /2 reads.

    Untangle shuffled FASTQ file
    seqtk seq -l0 Shuffled.fq | gawk '{if ((NR-1) % 8 < 4) print >> "Separate_1.fq"; else print >> "Separate_2.fq"}'
    

    After some operations on separate R1 and R2 FASTQ files, paired-end records will no longer be matched, and you need to eliminate records that are no longer matched.

    Convert FASTQ to one-line-per-record tabular file
    gawk '{printf((NR % 4 == 0) ? $0"\n" : $0"\t")}' Sample.fastq > Sample_1-line.tab
    
    Take matching paired-end reads
    sort Sample_1-line_R1.fq > Sample_1-line_sorted_R1.tab
    sort Sample_1-line_R2.fq > Sample_1-line_sorted_R2.tab
    join Sample_1-line_sorted_R1.tab Sample_1-line_sorted_R2.tab > Sample_1-line_joined.tab
    
    Write joined one-line FASTQ information to two FASTQ files
    gawk '{printf($1"\n"$2"\n"$3"\n"$4"\n") >> "Sample_matched_R1.fq"; printf($1"\n"$5"\n"$6"\n"$7"\n") >> "Sample_matched_R2.fq"}' Sample_1-line_joined.tab
    
    , multiple selections available,

    Confluence Documentation | Web Privacy Policy | Web Accessibility

    University Wiki Service

    Bioinformatics Team (BioITeam) at the University of Texas
    • File lists
      File lists
       This trigger is hidden
    • How-to articles
      How-to articles
       This trigger is hidden
    Results will update as you type.
    • Bioinformatics Services
    • CBRS Mini Symposium
    • BCG Full Service Pipelines
    • zArchive
      • TACC stuff
      • Old Home page
      • Old BCG Full Service Pipelines
      • Training
        • Bioinformatics Courses and Content
        • SSC Intro to NGS Bioinformatics Course
          • Lonestar Profile
          • Diagram of Lonestar's directories
          • Diagram of running a job on Lonestar
          • Evaluating your raw sequencing data
          • Mapping tutorial
          • Integrative Genomics Viewer (IGV) tutorial
          • Workflow diagram of variant calling
          • Getting started with Unix and Perl
          • Variant calling tutorial
          • Visualize mapped data at UCSC genome browser
          • Annotating Variants
          • Installing Linux tools
          • Shell Script
          • Mapped read data evaluation (SAMtools)
          • Calling variants in diploid or multiploid genomes
          • Variant calling with GATK
          • Genome variation in mixed samples (FreeBayes, deepSNV)
          • Identifying structural variants (SVDetect)
          • Practical advice - short read re-sequencing data
          • SRA toolkit
          • Differential gene expression analysis
          • Differential expression with splice variant analysis aug2012
          • Identifying mutations in microbial genomes (breseq)
          • non-coding RNA analysis
          • Genome Assembly
          • Genome Assembly (velvet)
          • Genome Annotation (Glimmer3)
          • Evaluating & Visualizing assemblies
          • Custom Genome Databases
          • Transcriptome assembly & annotation
          • Scott's list of linux one-liners
          • Exercises
          • Introduction to genome variation
          • instructor action item list
          • General introduction
          • Recap and "for further study"
          • Handling Sequences Overview
          • As you're getting settled
          • Editing files
          • Installing Linux virtual machine on Windows
          • Installing Virtual machine & Linux on Windows
          • Key take home points
          • Linux final
          • Linux start
          • Samtools tricks
          • Start tophat by submitting to lonestar
          • Tutorial - Start diploid mapping for Day 2
          • Using SFTP for file browsing on Linux.
        • BME 383J Course Content
        • Obsolete NGS course materials
        • Relevant classes offered at UT System Schools
        • Short Training Topics
      • Software
      • How to join the BioITeam
      • File lists
      • Bioinformatic Jobs at UT
      • Wish list
      • How-to articles
      • Appsoma-based pipeline development
    • Byte Club
    • Introduction to Biocomputing: From files to functions to plots
    • R/RStudio Part I
      Calendars

    You‘re viewing this with anonymous access, so some content might be blocked.
    {"serverDuration": 11, "requestCorrelationId": "71209b6a98714d11b0cedb0b7343c8ac"}