Tricks to preprocess SOLiD and 454 data

Some tricks to preprocess/assess ABI SOLiD data

  • Look for dominant sequences in your data
    • grep -v '^>' F3.csfasta |sort|uniq -c -w 25|sort -n -r|head -20
    • F3.csfasta : Input file- raw csfasta file from ABI SOLiD
    • This command looks for dominant sequences with unique bases in the first 25 bases of the read - change 25 if you want more or less o the read to be considered when looking for dominant sequences.

Some tricks to preprocess/assess 454 data

  • Make 454 data into format of one sequence per line
    • makeSeqsOneLine 454.fna > 454.modified.fna
    • 454.fna : Input file of raw 454 data
    • 454.modified.fna : Output file of modified 454 data
  • Pull out read sequences (with read id) containing a certain pattern (Let's say 'TAGGAC')
    • grep -B 1 'TAGGAC'  454.modified.fna |grep -v '^-' > 454.pattern.fna
    • 454.modified.fna : Modified 454 data
    • 454.pattern.fna : Fasta file with only reads containing the specified pattern.
  • Pull out read sequences (with read id) starting with a certain pattern (Let's say 'TAGGAC')
    • grep -B 1 'TAGGAC'  454.modified.fna |grep -v '-' > 454.pattern.fna
    • 454.modified.fna : Modified 454 data
    • 454.pattern.fna : Fasta file with only reads starting with the specified pattern.
  • To get the reverse complement sequences for a fasta file, run the following command on fourierseq:
    • reversecomplement.pl test.fasta|sed 's/U/T/g' > test.revcomp.fasta
    • test.fasta: Fasta input file
    • test.revcomp.fasta : Fasta output file, with reverse complemented sequences