Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Data wrangling best practices

NGS is smack dab in the middle of the Big Data revolution. Initial NGS fastq FASTQ files are big (100s of MB to GB) – and they're just the start.

...

arrange adequate storage space

  • At TACC
    • Obtain an allocation on TACC's
    shared work file system or
    • corral disk array (initial 5 TB are no-cost)
    • Stage your active projects on corral or $WORK
      • copy data to $SCRATCH for analysis
      • copy important analysis products back to corral or $WORK
    • Periodically back up corral or $WORK directories to ranch tape archive
  • On a UT Biomedical Research Support Facility (BRCF) "POD"
    • See https://wikis.utexas.edu/display/RCTFusers
      • Home and Work areas on POD servers are automatically backed up weekly
        • and archived to ranch every 4-6 months
    • GSAF customers can obtain a no-cost 2 TB allocation on the GSAF POD

backup analysis artifacts regularly

...

Artifacts from different stages of the analysis will have different archival requirements.

  • Original sequence data (fastq FASTQ files)
    • must be backed up!
  • Alignments
    • usually larger than original fastq FASTQs
    • can be backed up once stable
  • Downstream analysis artifacts
  • Reporting artifacts (plots, plotting code)

...

...