Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
/
Removing duplicates from alignment output

    Removing duplicates from alignment output

    Jul 25, 2012

    If you see in your bam file, that reads are piling up with same start end end coordinates, these may be pcr duplicates, which should be removed/flagged in your bam files.

    Picard MarkDuplicates is the preferred tool for this, but it is very fickle with the type of bam file it will work on.

    Samtools can be an easier option to start with for removing potential pcr duplicates in your data.

    1. (OPTIONAL) samtools fixmate

    Because samtools rmdup works better when the insert size is set correctly, samtools fixmate can be run to fill in mate coordinates, ISIZE and mate related flags from a name-sorted alignment.

    samtools fixmate <in.nameSrt.bam> <out.bam>
    

    2. samtools rmdup -sS <input.srt.bam> <out.bam>
    Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality. In the paired-end mode, this command ONLY works with FR orientation and requires ISIZE is correctly set. It does not work for unpaired reads (e.g. two ends mapped to different chromosomes or orphan reads).

    OPTIONS:
    -s Remove duplicate for single-end reads. By default, the command works for paired-end reads only.
    -S Treat paired-end reads as single-end reads.

    default:
    samtools rmdup <input.bam> <output.bam>
    or
    samtools rmdup -s <input.bam> <output.bam>
    

    Load the output.bam file into IGV to check on areas which showed evidence of pcr duplicates before.

    , multiple selections available,

    Confluence Documentation | Web Privacy Policy | Web Accessibility

    University Wiki Service

    Bioinformatics Team (BioITeam) at the University of Texas
    • File lists
      File lists
       This trigger is hidden
    • How-to articles
      How-to articles
       This trigger is hidden
    Results will update as you type.
    • GS De novo assembler
    • GS Reference mapper
    • GS Run processor and run browser
    • Hmmer
    • IGV
    • Make a quick venn diagram based on lists in 3 files
    • mapreads
    • MAQ
    • MaqView
    • Median polish to consolidate quantitations
    • MegaMapper
    • MeV
    • MIRA
    • Mosaik
    • muscle
    • Phred, Phrap, Consed, cross_match, daev
    • Picard
    • Plot a read length histogram based on sequences in a fasta file
    • Python Library
    • Quick tips on GO analysis
    • R and R packages
    • Reverse complement for fasta files
    • RNA-seq workflow
    • SAMTOOLS
    • Sff file manipulation tools
    • SHRiMP
    • Small rna analysis
    • Small-rna data analysis
    • Small RNA Pipeline
    • SOAP
    • SOAPtrans
    • Tips for working with TACC resources
    • Tophat- Cufflinks
    • Tophat-Cufflinks-Cuffdiff, allowing for novel transcripts
    • Tophat-Cufflinks-Cuffdiff, ignoring novel transcripts
    • Tricks to preprocess SOLiD and 454 data
    • Trinity
    • Variant calling
    • Velvet
    • ZOHO Information
    • Removing duplicates from alignment output
    • SAMStat
    • TACC Lonestar workflow scripts
    • Get FASTQ Format
    • launcher_creator.py
    • Wrappers For TACC
    • ssh - generating keys
      Calendars

    You‘re viewing this with anonymous access, so some content might be blocked.
    {"serverDuration": 12, "requestCorrelationId": "b7f31baeed5747d48c9e5cd35b47ae41"}