# ZGA - prokaryotic genome assembly and annotation pipeline

[![version status](https://img.shields.io/pypi/v/zga.svg)](https://pypi.python.org/pypi/zga)
[![Anaconda Cloud](https://anaconda.org/laxeye/zga/badges/installer/conda.svg)](https://anaconda.org/laxeye/zga/)

## Installation

ZGA is written in Python and tested with Python 3.6 and Python 3.7. ZGA uses several software and libraries including:

* [fastp](https://github.com/OpenGene/fastp)
* [BBmap](https://sourceforge.net/projects/bbmap/)
* [NxTrim](https://github.com/sequencing/NxTrim)
* [mash](https://mash.readthedocs.io/en/latest/)
* [SPAdes](http://cab.spbu.ru/software/spades/) (>= 3.12 to support merged paired-end reads, >= 3.5.0 to support Nanopore reads)
* [Unicycler](https://github.com/rrwick/Unicycler/)
* [Flye](https://github.com/fenderglass/Flye) >= 2.6
* [racon](https://github.com/lbcb-sci/racon)
* [CheckM](https://github.com/Ecogenomics/CheckM) >= 1.1.0
* [BioPython](https://biopython.org/)
* [NCBI BLAST+](https://blast.ncbi.nlm.nih.gov/Blast.cgi)
* [DFAST](https://github.com/nigyta/dfast_core)

### Install with conda

The simplest way to install ZGA and all dependencies is conda:

1. You need to install conda, e.g. [**miniconda**](https://conda.io/en/latest/miniconda.html). Python 3.7 is preferred.

2. After installation You should add channels - the conda's software sources:  
`conda config --add channels defaults`  
`conda config --add channels bioconda`  
`conda config --add channels conda-forge`

3. At the end You should install ZGA to an existing active environment (Python 3.6 or 3.7):  
`conda install -c laxeye zga`  
or create a fresh environment and activate it:  
`conda create -n zga -c laxeye zga`  
`conda activate zga`

[![Anaconda latest release](https://anaconda.org/laxeye/zga/badges/latest_release_date.svg)](https://anaconda.org/laxeye/zga/)

### Installing dependencies

All dependencies may be installed using **conda**:

It's highly recommended to create a new conda environment:

`conda create -n zga "python>=3.6" fastp "spades>=3.12" unicycler checkm-genome dfast bbmap blast biopython nxtrim "mash>=2" flye racon "samtools>=1.9"`

and activate it

`conda activate zga`

Otherwise you may install dependencies to existing conda environment:

`conda install "python>=3.6" fastp "spades>=3.12" unicycler checkm-genome dfast bbmap blast biopython nxtrim "mash>=2" flye racon "samtools>=1.9"`

Of course, it's possible to use *another ways* even compile all tools from source code. In this case you should check if binaries are in your '$PATH' variable.

### Install from PyPI

Run `pip install zga`. Biopython is the only one dependency installed from PyPI. All other dependencies You should install manually or using **conda** as mentioned above. CheckM is available on **PyPi**, but it's easier to install it using **conda**.

### Get source from Github

You can get ZGA by cloning from the repository with `git clone https://github.com/laxeye/zga.git` or by downloading an archive.
After downloading enter the directory and run `python3 setup.py build && python3 setup.py install`.

### Operating systems requirements

ZGA was tested on Ubuntu 18.04 and 19.10. Most probably any modern 64-bit Linux distribuition is enough.

Your feedback on other OS is welcome!

## Usage

Run `zga -h` to get a help message.

Examples:

Perform all steps: read qc, read trimming and merging, assembly, CheckM assesment with default (bacterial) marker set, DFAST annotation and use 4 CPU threads where possible:

`zga -1 R1.fastq.gz -2 R2.fastq.gz --threads 4 -o my_assembly`

Assemble with SPAdes using paired-end and nanopore reads of archaeal genome (CheckM will use archaeal markers) altering memory limit to 16 GB:

`zga -1 R1.fastq.gz -2 R2.fastq.gz --nanopore MiniION.fastq.gz -a spades --threads 4 --memory-limit 16 --domain archaea -o my_assembly`

Assemble long reads with Flye skipping long read polishing and perfom short-read polishing with racon:

`zga -1 R1.fastq.gz -2 R2.fastq.gz --nanopore MiniION.fastq.gz -a flye --threads 4 --domain archaea -o my_assembly --flye-short-polish --skip-flye-long-polish`

Assemble from Nanopore reads using unicycler:

`zga -a unicycler --nanopore MiniION.fastq -o nanopore_assembly`

Perform assesment and annotation of genome assembly with 'Pectobacterium' CheckM marker set:

`zga --first-step check_genome -g pectobacterium_sp.fasta --checkm_rank genus --checkm_taxon Pectobacterium -o my_output_dir`

Let CheckM to infer the right marker set:

`zga --first-step check_genome -g my_genome.fa --checkm_mode lineage -o my_output_dir`

## Known issues and limitations

ZGA is in the stage of active development.

Known issues and limitations:

* It's not posible to provide multiple read libraries i.e. two sets of PE reads or two nanopore runs.
* Unicycler doesn't use mate-pair reads.
* It's not possible to install all dependencies with Python 3.8 via conda, please use 3.7 or 3.6.

Don't hesitate to report bugs or features!

## Cite

It's a great pleasure to know, that your software is useful. Please cite ZGA:

Korzhenkov A. (2020). ZGA: prokaryotic genome assembly and annotation pipeline.

And of course tools it's using:

Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884-i890. https://doi.org/10.1093/bioinformatics/bty560

Bushnell, B., Rood, J., & Singer, E. (2017). BBMerge–accurate paired shotgun read merging via overlap. PloS one, 12(10).

Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., ... & Pyshkin, A. V. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5), 455-477.

Wick, R. R., Judd, L. M., Gorrie, C. L., & Holt, K. E. (2017). Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS computational biology, 13(6), e1005595.

Vaser, R., Sović, I., Nagarajan, N., & Šikić, M. (2017). Fast and accurate de novo genome assembly from long uncorrected reads. Genome research, 27(5), 737-746.

Kolmogorov, M., Yuan, J., Lin, Y., & Pevzner, P. A. (2019). Assembly of long, error-prone reads using repeat graphs. Nature biotechnology, 37(5), 540-546.

Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome research, 25(7), 1043-1055.

Tanizawa, Y., Fujisawa, T., & Nakamura, Y. (2018). DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication. Bioinformatics, 34(6), 1037-1039.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410.

Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., ... & De Hoon, M. J. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422-1423.

O’Connell, J., et al. (2015) NxTrim: optimized trimming of Illumina mate pair reads. Bioinformatics 31(12), 2035-2037.

Ondov, B.D., Treangen, T.J., Melsted, P. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132 (2016). doi: 10.1186/s13059-016-0997-x
