# ProbeTools
ProbeTools is a collection of general-purpose modules for designing hybridization probe panels targeting diverse and hypervariable viral taxa. The objective of ProbeTools is to generate the smallest possible panel of oligo sequences that maximizes coverage of provided target sequences. It is based on k-mer clustering. In brief, probe-length k-mers are enumerated from the target space, usually spaced one nucleotide apart so that all possible k-mers are enumerated. The k-mers are then clustered based on their nucleotide sequence identity to collapse redundant probes enumerated from conserved genomic loci. Cluster centroids become probe candidates, which are ranked based on the size of the cluster they represent; centroids representing larger clusters are assumed to make better probes by virtue of having similarity to more sequence in the target space. 

ProbeTools can further optimize probe panel designs by using an incremental strategy. In this strategy, probes are added to the panel in batches. Between the addition of each batch, ProbeTools determines what regions of the target space have achieved coverage and removes them from the target space before designing the next batch. This improves coverage of less-common sequences in the target space and reduces the generation of redundant probes.

Additional details and discussion about ProbeTools, along with <i>in silico</i> and <i>in vitro</i> validation results can be found in:

Kuchinski <i>et al.</i> (2021) ProbeTools: Hybridization probe design for targeted genomic sequencing of diverse and hypervariable viral taxa.

# Setup 
ProbeTools requires VSEARCH and BLASTn. The ProbeTools package can be installed with these dependencies via Anaconda/Miniconda. It can also be installed separate from its dependencies via the Python Package Index (PyPI).
## Anaconda/Miniconda
1. Create a conda environment for ProbeTools (replace env_name with a name of your choice for the ProbeTools environment):
```
conda create -n env_name -c kevinkuchinski probetools
```
## PyPI 

1. Install Python (version 3.7 or greater) from https://www.python.org/
2. Install the ProbeTools package:
```
pip install probetools
```
3. Install VSEARCH (version 2.15.2 recommended) from https://github.com/torognes/vsearch
4. Install BLAST (version 2.10.0 recommended) from https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download


# Quick-start to probe design
ProbeTools provides the <b>makeprobes</b> module as a user-friendly, general-purpose implementation of the incremental k-mer clustering strategy. Simply indicate a FASTA file containing target sequences (-t), the number of probes to add each batch (-b), and an output path and design name to append to output files (-o):
```
probetools incrementalprobes -t target_space_FASTA.fa -b 100 -o demo_probes_dir/demo_probes
```
<b>makeprobes</b> will add batches of probes to the panel until one of three end points is reached:
1. The panel achieves a target coverage goal (default: 90% of target sequences have at least 90% of their nucleotide positions covered)
2. The panel reaches a specific size (default: MAX, i.e. the panel continues to grow until one of the other end points is reached)
3. No further probe sequences can be designed

The desired coverage goal and the maximum panel size can be set, along with numerous other parameters (see usage guide below). In general, smaller batch sizes will provide more compact panels but take more rounds of design and, thus, longer to compute.

# ProbeTools modules
ProbeTools consists of 6 modules:
1. <b>makeprobes</b> - a user-friendly, general-purpose implementation of the incremental k-mer clustering strategy
2. <b>clusterkmers</b> - single-batch probe generation using the k-mer clustering algorithm
3. <b>capture</b> - <i>in silico</i> assessment of how well provided probe sequences cover provided target sequences
4. <b>getlowcov</b> - uses output of <b>capture</b> to extract low-coverage regions from provided target sequences
5. <b>stats</b> - uses output of <b>capture</b> to calculate coverage statistics overall and for each provided target sequence
6. <b>merge</b> - merges output files generated by <b>capture</b> module

# Usage guide for ProbeTools modules
## makeprobes
A general-purpose implementation of the incremental k-mer clustering strategy. Probes are added to the panel in batches. Between the addition of each batch, ProbeTools determines what regions of the target space have achieved coverage and removes them from the target space before designing the next batch. Probe sequences are provided in the output_name_probes.fa file with probe sequences ranked in descending order of cluster size. NOTE: for best results, all target sequences should be provided on the same strand/in the same sense.

<b>Usage example:</b>
```
$ probetools incrementalprobes -t <target seqs> -b <batch size> -o <output dir>/<output name> [<optional args>]
```
<b>Required arguments:</b>

     -t : path to target sequences in FASTA file
     -b : number of probes in each batch (min=1)
     -o : path to output directory and design name to append to output files
     
<b>Optional arguments:</b>

     -m : max number of probes to add to panel (default=MAX, min=1)
     -c : target for 10th percentile of probe coverage (default=90, min=1, max=100)
     -k : length of probes to generate (default=120, min=32)
     -s : number of bases separating each kmer (default=1, min=1)
     -d : number of degenerate bases to permit in probes (default=0, min=0)
     -i : nucleotide sequence identity (%) threshold used for kmer clustering and probe-target alignments (default=90, min=50, min=100)
     -l : minimum length for probe-target alignments (default=60, min=1)
     -D : minimum probe depth threshold used to define low coverage sub-sequences (default=0, min=0)
     -L : minimum number of consecutive bases below probe depth threshold to define a low coverage sub-sequence (default=40, min=1)
     -T : number of threads used by VSEARCH and BLASTn for clustering kmers and aligning probes to targets (default=MAX for VSEARCH, default=1 for BLASTn, min=1)
     
## clusterkmers
Enumerate and cluster kmers from target sequences. Extract cluster centroids as probe candidates ranked by cluster size. Probe sequences are provided in the output_name_probes.fa file with probe sequences ranked in descending order of cluster size. NOTE: for best results, all target sequences should be provided on the same strand/in the same sense.

<b>Usage example:</b>
```
$ probetools clusterkmers -t <target seqs> -o <output dir>/<output name> [<optional args>]
```
<b>Required arguments:</b>

     -t : path to target sequences in FASTA file
     -o : path to output directory and design name to append to output files
 
<b>Optional arguments:</b>

     -k : length of kmers to enumerate (default=120, min=32)
     -s : number of bases separating each kmer (default=1, min=1)
     -d : number of degenerate bases to permit in probes (default=0, min=0)
     -i : nucleotide sequence identity (%) threshold used for kmer clustering (default=90, min=50, max=100)
     -p : path to FASTA file containing previously-generated probe sequences to remove from new probes
     -n : number of probe candidates to return (default=MAX, min=1)
     -T : number of threads used by VSEARCH for clustering kmers (default=MAX, min=1)
 
## capture
Assess probe panel coverage of target sequences. BLASTn is used to align each provided probe sequence against each provided target sequence. BLASTn output is parsed to determine how many probes cover each nucleotide position in target sequences. Results are output to the output_name_capture.pt file (see .pt format specifications below).

<b>Usage example:</b>
```
$ probetools capture -t <target seqs> -p <probe seqs> -o <output dir>/<output name> [<optional args>]
```
<b>Required arguments:</b>

     -t : path to target sequences in FASTA file
     -p : path to probe sequences in FASTA file
     -o : path to output directory and design name to append to output files
 
<b>Optional arguments:</b>

     -i : nucleotide sequence identity (%) threshold used for probe-target alignments (default=90, min=50, max=100)
     -l : minimum length for probe-target alignments (default=60, min=1)
     -T : number of threads used by BLASTn for aligning probes to targets (default=1, min=1)

## getlowcov
Extract poorly covered sub-sequences from target sequences based on a specific set of capture results. Low-coverage sub-sequences are written to the output_name_low_cov.fa file.

<b>Usage example:</b>
```
$ probetools getlowcov -i <input file> -o <output dir>/<output name> [<optional args>]
```
<b>Required arguments:</b>

     -i : path to capture results in PT file
     -o : path to output directory and design name to append to output files
 
<b>Optional arguments:</b>

     -k : minimum sub-sequence length extracted, should be same as kmer length used for making probes (default=120, min=32)
     -D : minimum probe depth threshold used to define low coverage sub-sequences (default=0, min=0)
     -L : minimum number of consecutive bases below probe depth threshold to define a low coverage sub-sequence (default=40, min=1)

## stats
Calculate and tabulate probe coverage statistics for target sequences. Overall target space statistics are provided in output_name_summary_report.tsv and statistics for each target sequence are provided in output_name_long_report.tsv. Positions with degenerate bases do not count towards probe coverage calculations if they are not covered by probes.

<b>Usage example:</b>
```
$ probetools stats -i <input file> -o <output dir>/<output name>
```
<b>Required arguments:</b>

     -i : path to capture results in PT file
     -o : path to output directory and design name to append to output files
 
## merge
Combine results from two output files from the <b>capture</b> module. This module conducts an outer merge: if entries with the same header (and matching nucleotide sequences) appear in both files, their probe depth lists are summed together position-by-position. Entries appearing in only one or the other file are copied to the new file unmodified.

<b>Usage example:</b>
```
$ probetools merge -i <input file> -I <input file> -o <merged output file>
```
<b>Required arguments:</b>

     -i : path to capture results in PT file
     -I : path to other capture results in PT file
     -o : path to merge capture results PT file

# .pt Format Specifications
The .pt format is used for output from the capture module and input for stats and getlowcov modules. The .pt format is largely derived from the FASTA format. Each entry spans three lines, and each line starts with its own identifying character:

<b>Entry header (>):</b> A text header to describe the sequence. Do not use spaces in the header.

<b>Entry sequence ($):</b> The nucleotide sequence of the entry.

<b>Entry probe depths (#):</b> A comma-separated list of the number of probes covering each nucleotide position. The order of the list follows the order of the nucleotide sequence, i.e. the 4th number of the list describes the number probes covering the 4th nucleotide position of the entry's sequence.

<u>Example entry:</u>
```
>Entry_header
$ATGCGTTGACAGTGCACACG
#1,1,1,1,1,2,2,2,2,2,1,1,2,2,2,3,3,3,3,3
```
