split_blast

Data-Parallel BLAST+ On TACC

split_blast takes your FASTA input file and runs your BLAST+ commands on split chunks of the input data. The procedure it follows is:

  1. the input data is split into chunks in a manner you specify,
  2. BLAST commands are run in parallel on the separate input data chunks,
  3. the results from the separate BLAST runs are concatenated into a final file.

split_blast only works with BLAST+, not the earlier BLAST. That means using commands like blastx or blastp, rather than the earlier blastall.

Running split_blast

To run split_blast, you write out an ordinary BLAST+ command, but you must preface it with two types of information: 1) how you want to split the data, and 2) parameters for TACC (your allocation, and how long you want to let BLAST run).

Splitting the data

To split the data, you have three options: splitting by number of splits, how many records (sequences) you want per split file, or how large you want each split file to be.

-N

number of split files

-R

number of records per split file

-M

amount of memory per split file

(warning) Only specify one of these!

Giving TACC what it needs

The launcher that split_blast uses needs to know how many resources to request from TACC. It needs time to run, and the allocation to charge.

-a

Allocation to use

-t

Time to run

An ordinary BLAST+ command

After you've entered the splitting options and the TACC information, just write out a standard BLAST+ command. The output format (-outfmt) must be either 5, 6 or 7 so that split_blast will know how to combine the individual split results. Also, don't bother specifying a -num_threads option, as split_blast will override it in any case.

A split_blast example

The following example runs blastx against the nr database, splitting the input FASTA file up into 6 chunks before running BLAST on them.

split_blast -N 6 -a <allocation> -t 12:00:00 blastx -outfmt 6 -db /corral-repl/utexas/BioITeam/blastdb/nr -max_target_seqs 1 -query <input>.fasta -out <output>.txt

For larger datasets, you will want to split the input into more chunks. Using 20-30 chunks wouldn't be unreasonable. You could also use something like -M 100M, and let the program split the data into enough data chunks so that no chunk is larger than 100 MB.