split_blast
Data-Parallel BLAST+ On TACC
split_blast
takes your FASTA input file and runs your BLAST+ commands on split chunks of the input data. The procedure it follows is:
- the input data is split into chunks in a manner you specify,
- BLAST commands are run in parallel on the separate input data chunks,
- the results from the separate BLAST runs are concatenated into a final file.
split_blast
only works with BLAST+, not the earlier BLAST. That means using commands like blastx
or blastp
, rather than the earlier blastall
.
Running split_blast
To run split_blast
, you write out an ordinary BLAST+ command, but you must preface it with two types of information: 1) how you want to split the data, and 2) parameters for TACC (your allocation, and how long you want to let BLAST run).
Splitting the data
To split the data, you have three options: splitting by number of splits, how many records (sequences) you want per split file, or how large you want each split file to be.
-N |
number of split files |
-R |
number of records per split file |
-M |
amount of memory per split file |
Only specify one of these!
Giving TACC what it needs
The launcher that split_blast
uses needs to know how many resources to request from TACC. It needs time to run, and the allocation to charge.
-a |
Allocation to use |
-t |
Time to run |
An ordinary BLAST+ command
After you've entered the splitting options and the TACC information, just write out a standard BLAST+ command. The output format (-outfmt) must be either 5, 6 or 7 so that split_blast
will know how to combine the individual split results. Also, don't bother specifying a -num_threads option, as split_blast
will override it in any case.
A split_blast
example
The following example runs blastx
against the nr
database, splitting the input FASTA file up into 6 chunks before running BLAST on them.
split_blast -N 6 -a <allocation> -t 12:00:00 blastx -outfmt 6 -db /corral-repl/utexas/BioITeam/blastdb/nr -max_target_seqs 1 -query <input>.fasta -out <output>.txt
For larger datasets, you will want to split the input into more chunks. Using 20-30 chunks wouldn't be unreasonable. You could also use something like -M 100M
, and let the program split the data into enough data chunks so that no chunk is larger than 100 MB.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.