Metadata-Version: 2.1
Name: haystac
Version: 0.4.9
Summary: Species identification pipeline for both single species and metagenomic samples.
Home-page: https://github.com/antonisdim/haystac
Author: Evangelos A. Dimopoulos, Evan K. Irving-Pease
Author-email: antonisdim41@gmail.com
License: MIT
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Description-Content-Type: text/markdown
License-File: LICENSE

[![Documentation Status](https://readthedocs.org/projects/haystac/badge/?version=latest)](https://haystac.readthedocs.io/en/latest/?badge=latest)

# HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data 

## Introduction 

`haystac` is a light-weight, fast, and user-friendly species identification tool. It evaluates the presence of a 
particular species of interest in a metagenomic sample, and provides statistical support for the species assignment. 
The method is designed to estimate the probability that a specific taxon is present in a metagenomic sample given a set 
of sequencing reads and a database of reference genomes. It works equally well with both modern and ancient DNA sequence
data.

## Setup

`haystac` can be run on either macOS or Linux based systems.

The recommended way to install `haystac`, and all if its dependencies, is via the [mamba](https://mamba.readthedocs.io/en/latest/installation.html)
package manager (a fast replacement for [conda](https://docs.conda.io/projects/conda/en/latest/index.html)).

You can install `haystac` using `conda`, however, it will take significantly longer to install and analyses will run slower.

### Install mamba

If you do not have either `mamba` or `conda` already installed, please refer to the [install instructions](
https://mamba.readthedocs.io/en/latest/installation.html) for `mambaforge`.

If you have `conda` installed, but not `mamba`, then install `mamba` into the base environment:
```bash
conda install -n base -c conda-forge mamba
```
### Install haystac
Then use `mamba` to install `haystac` into a new environment:
```bash
mamba create -c conda-forge -c bioconda -n haystac haystac
```
And activate the environment:
```bash
conda activate haystac
```
We recommend that you install `haystac` into a new environment to avoid dependency conflicts with other software.

## Quick Start

`haystac` consists of three main modules:
1) `database` for building a database of reference genomes
2) `sample` for pre-processing of samples prior to analysis
3) `analyse` for analysing a sample against a database

### 1. Build a database

To begin using `haystac` we firstly need to construct a database containing all species of interest to our study. In our 
[preprint](https://www.biorxiv.org/content/10.1101/2020.12.16.419085v1), we show that `haystac` makes robust species 
identifications with genus specific databases (for prokaryotes), allowing for very fast hypothesis driven analyses.

In this example, we will build a database containing all species in the *Yersinia* genus, by supplying `haystac` with a 
simple NBCI search query.
```
haystac database \
    --mode build \
    --query '"Yersinia"[Organism] AND "complete genome"[All Fields]' \
    --output yersinia_db
```

To construct an NCBI search query for your area of interest, visit the [NCBI Nucleotide database](
https://www.ncbi.nlm.nih.gov/nucleotide/) and use the search feature to obtain a correctly formatted query string from 
the "Search details" box. This search query can be used directly with `haystac` to automatically download and build 
a reference database based on the accession codes present in the resultset returned by the query.

For more exhaustive analyses, you can build a database containing the 5,681 species present in the [RefSeq 
representative](https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/) database of prokaryotic species by running:
```
haystac database \
    --mode build \
    --refseq-rep prokaryote_rep \
    --output refseq_db
```

Note: Building a database this big is not recommended on a laptop computer. 

### 2. Prepare a sample for analysis

The second step in using `haystac` is to prepare a sample for analysis.

In this example, we will download an aDNA library from [Rasmussen *et al.* (2015)](https://doi.org/10.1016/j.cell.2015.10.009), 
by giving `haystac` the SRA accession code [ERR1018966](https://www.ncbi.nlm.nih.gov/sra/?term=ERR1018966). 
Most published genomics papers include a BioProject code (e.g. [PRJEB10885](
https://www.ncbi.nlm.nih.gov/bioproject/PRJEB10885)), from which you can obtain SRA accessions for each sequencing 
library.
``` 
haystac sample \
    --sra ERR1018966 \
    --output ERR1018966
```

To prepare a sample of your own, you will need either single-end or paired-end short read sequencing data in 
`fastq` format. 

For a paired-end library, you specify the location of the `fastq` files and the name of
the output directory. You may also choose to collapse overlapping mate pairs (e.g. for an aDNA library).
```
haystac sample \
    --fastq-r1 /path/to/sample1_R1.fq.gz \
    --fastq-r2 /path/to/sample1_R2.fq.gz \
    --collapse True \
    --output sample1
```
By default, `haystac` will scan the supplied library, identify adapter sequences, and automatically remove them.

### 3. Analyse a sample against a database

The third step in using `haystac` is to perform an analysis of a sample against a database.

Here, we will use `haystac` to calculate the mean posterior abundance of all species in the *Yersinia* genus found within
the sample `ERR1018966`.
```
haystac analyse \
    --mode abundances \
    --database yersinia_db\
    --sample ERR1018966 \
    --output yersinia_ERR1018966
```

When the analysis is complete, there will be several new sub-folders in the output directory `yersinia_ERR1018966/`. To 
determine if sample `ERR1018966` contains *Yersinia pestis* (i.e. the plague) we can consult the spreadsheet containing
the mean posterior abundance estimates for all species in the *Yersinia* database (i.e., 
`yersinia_ERR1018966/probabilities/ERR1018966/ERR1018966_posterior_abundance.tsv`). From this, we can see that 3,266 
reads were uniquely assigned to *Yersinia pestis*, with an overall abundance of 0.047%, and that the chi-squared test 
indicates that the reads are spread evenly across the genome.

Before we can confidently conclude that `ERR1018966` contains ancient *Yersinia pestis*, we may want to perform a 
damage pattern analysis.
```
haystac analyse \
    --mode abundances \
    --database yersinia_db\
    --sample ERR1018966 \
    --output yersinia_ERR1018966 \
    --mapdamage True
```


## User documentation

`haystac` has many features and potential uses, and we encourage you to use module help menus (e.g. `haystac database --help`) 
to explore these options. The full user documentation is available here: https://haystac.readthedocs.io/en/master/


## Reporting errors

`haystac` is under active development and we encourage you to report any issues you encounter via the [GitHub issue 
tracker](https://github.com/antonisdim/haystac/issues).  

 
## Citation

A preprint describing `haystac` is available on *bioRxiv*:
 
> Dimopoulos, E.A.\*, Carmagnini, A.\*, Velsko, I.M., Warinner, C., Larson, G., Frantz, L.A.F., Irving-Pease, E.K., 2020. 
> HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data. 
> *bioRxiv* 2020.12.16.419085. https://www.biorxiv.org/content/10.1101/2020.12.16.419085v1

## License 
MIT


