Optional: HTseqHTseq is another tool to count reads. bedtools has many many useful functions, and counting reads is just one of them. In contrast, HTseq is a specialized utility for counting reads, and it does not have many functions other than that. HTseq is very slow and you need to run multiple command lines in order to do the same job as what bedtools multicov did. Why do we learn this? Well, you may want to care about reads mapped on intersection when you count reads. Please take a look at this page, and if this sophisticated counting method looks useful for you, use HTseq. Otherwise, use bedtools. | Code Block |
|---|
grep "^NC_017544" NC_017544.1.gff > count_ref.gff
samtools view SRR034450.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count1.gff
samtools view SRR034451.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count2.gff
samtools view SRR034452.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count3.gff
samtools view SRR034453.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count4.gff
join count1.gff count2.gff | join - count3.gff | join - count4.gff > gene_counts_HTseq.gff
#if you have many samples, use for-loop and join
|
gene_counts_HTseq.gff has 5 more lines than gene_counts.gff. Check out the last 5 lines. They are basic statistics. | Code Block |
|---|
tail gene_counts_HTseq.gff
wc -l gene_counts_HTseq.gff
head -2910tail gene_counts_HTseq.gff
|
The basic statistics (last 5 lines) is good useful to know, but should be removed to use it as a input file for DEGseq | Code Block |
|---|
head -2910 gene_counts_HTseq.gff > gene_counts_HTseq.tab
|
Finally, gene_counts_HTseq.tab is ready to use. HTseq-count is strand-specific in default. Therefore, read counts for each gene in gene_counts_HTseq.gff are approximately a half of that counts in gene_counts.gff for the corresponding gene. Analyze differential gene expressionDESeqOur data that is cluttered with a lot of extra columns and one column stuffed with tag=value information (including the gene names that we want!). Let's clean it up a bit before loading into R - which likes to work on simple tables. GFF are tab-delimited files. We can do this cleanup many ways, but a quick one is to use the Unix string editor sed. This command replaces the entire beginning of the line up to locus_tag= with nothing (that is, it deletes it). This conveniently leaves us with just the locus_tag and the columns of read counts in each gene. If you were writing a real pipeline, you would probably want to use a Perl or Python script that would check to be sure that each line had the locus_tag (they do), among other things. | Code Block |
|---|
| title | Reformatting gene_counts.gff |
|---|
|
head gene_counts.gff
sed 's/^.*locus_tag=//' gene_counts.gff > gene_counts.tab
|
After it has run, take a peek at the new file: | Code Block |
|---|
head gene_counts.tab
|
| Warning |
|---|
| title | Be very careful how you copy and paste from the example below. |
|---|
| |
Do not copy the > characters. Some commands are spread across multiple lines. The > are missing at the beginning of the lines after the first one in these cases. So this: | Code Block |
|---|
> y <- c(
1:10
)
> y
|
Is the same as: | Code Block |
|---|
> y <- c(1:10)
> y
|
It's ok to copy across the multiple lines and paste into R as long as you get all the way to the closing parenthesis. |