Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

Organization and good practices are critical! Your data can get out of hand very quickly!

...

Keep FASTQ files compressed

  • Most sequencing facilities will give you compressed sequencing data files
    • gzip format (.gz extension) for individual files
    • tar or zip format for directories of files
  • Even with compression it's easy to run out of storage space!

You may be tempted decompress your sequencing files to manipulate them more directly

  • resist Resist the temptation to gunzip!
  • nearly Nearly all modern bioinformatics tools are able to work on .gz files
  • there There are techniques for working with compressed files without ever decompressing them

...

Arrange adequate storage space

  • At TACC
    • Obtain an allocation on TACC's corral disk array (initial 5 TB are no-cost)
    • Stage your active projects on corral or $WORK
      • copy data to $SCRATCH for analysis
      • copy important analysis products back to corral or $WORK
    • Periodically back up corral or $WORK directories to ranch tape archive
  • On a UT Biomedical Research Support Facility (BRCF) "POD"
    • See https://wikisutexas.utexasatlassian.edunet/wiki/display/RCTFusers/
      • Home and Work areas on POD servers are automatically backed up weekly
        • and are periodically archived to ranchevery 4-6 months 
    • GSAF customers can obtain a no-cost 2 TB allocation on the shared GSAF POD

...

Backup analysis artifacts regularly

  • All TACC users automatically have a 2 TB allocation TACC's ranch tape archive system
    • larger Larger allocations can be requested by project owners in the TACC User Portal
    • freeFree!
  • Periodically back up your corral or $WORK directories to ranch tape archive
    • large Large directories should be combined first using the tar program

...

Distinguish between types of data

Artifacts Files from different stages of the analysis will have different archival requirements.

  • Original sequence data (FASTQ files)
    • must Must be backed up!
  • Alignments
    • usually Usually larger than original FASTQs
    • can Can be backed up once stable
  • Downstream analysis artifactsresults
  • Reporting artifacts files (plots, plotting code)

While a project is active you will may want to keep more intermediate artifacts results for reference. Many of these can be removed after publication.

...

Track your analysis steps

Your analyses should be reproducible by others so you need to keep the equivalent of a lab notebook to document your protocols.

...