Metadata-Version: 2.1
Name: helen
Version: 0.0.2
Summary: RNN based assembly HELEN. It works paired with MarginPolish.
Home-page: https://github.com/kishwarshafin/helen
Author: Kishwar Shafin
Author-email: kishwar.shafin@gmail.com
License: UNKNOWN
Description: # H.E.L.E.N.
        H.E.L.E.N. (Homopolymer Encoded Long-read Error-corrector for Nanopore)
        
        
        [![Build Status](https://travis-ci.com/kishwarshafin/helen.svg?branch=master)](https://travis-ci.com/kishwarshafin/helen)
        ___________________________________________________________
        Pre-print of a paper describing the methods and overview of a suggested `de novo assembly` pipeline is now available:
        #### [Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit](https://www.biorxiv.org/content/10.1101/715722v1)
        __________________________________________________________
        
        ## Overview
        `HELEN` is a polisher intended to use for polishing human-genome assemblies. `HELEN` operates on the pileup summary generated by [MarginPolish](https://github.com/UCSC-nanopore-cgl/marginPolish). `MarginPolish` uses a probabilistic graphical-model to encode read alignments through a draft assembly to find the maximum-likelihood consensus sequence. The graphical-model operates in run-length space, which helps to reduce errors in homopolymeric regions. `MarginPolish` can produce tensor-like summaries encapsulating the internal likelihood weights. The weights are assigned to each genomic position over multiple likely outcomes that is suitable for inference by a Deep Neural Network model.
        
        `HELEN` uses a Recurrent-Neural-Network (RNN) based Multi-Task Learning (MTL) model that can predict a base and a run-length for each genomic position using the weights generated by `MarginPolish`.
        
        © 2019 Kishwar Shafin, Trevor Pesout, Benedict Paten. <br/>
        Computational Genomics Lab (CGL), University of California, Santa Cruz.
        
        ## Why MarginPolish-HELEN ?
        * `MarginPolish-HELEN` outperforms other graph-based and Neural-Network based polishing pipelines.
        * Easily usable via Docker for both `GPU` and `CPU`.
        * Highly optimized pipeline that is faster than any other available polishing tool (~4 hours for `HELEN`).
        * We have <b>sequenced-assembled-polished 11 samples</b> to ensure robustness, runtime-consistency and cost-efficiency.
        * We tested GPU usage on `Amazon Web Services (AWS)` and `Google Cloud Platform (GCP)` to ensure scalability.
        * Open source [(MIT License)](LICENSE).
        
        ## Walkthrough
        A `demo` walkthrough is available here: [demo](docs/walkthrough.md)
        
        ## Table of contents
        * [Workflow](#workflow)
        * [Installation](#Installation)
        * [Usage](#Usage)
        * [Models](#Models)
           * [Released Models](#Released-Models)
        * [Runtime and Cost](#Runtime-and-Cost)
        * [Results](#Results)
        * [Eleven high-quality assemblies](#Eleven-high-quality-assemblies)
        * [Help](#Help)
        * [Acknowledgement](#Acknowledgement)
        
        ## Workflow
        
        The workflow is as follows:
        * Generate an assembly with [Shasta](https://github.com/chanzuckerberg/shasta).
        * Create a mapping between reads and the assembly using [Minimap2](https://github.com/lh3/minimap2).
        * Use [MarginPolish](https://github.com/UCSC-nanopore-cgl/marginPolish) to generate the images.
        * Use HELEN to generate a polished consensus sequence.
        <p align="center">
        <img src="img/pipeline.svg" alt="pipeline.svg" height="640p">
        </p>
        
        ## Installation
        We have docker support for both `MarginPolish` and `HELEN`. Users can install `MarginPolish` and `HELEN` on <b>`Ubuntu 18.04`</b> or any other Linux-based system by following the instructions from our [Installation Guide](docs/installation.md).
        
        If you have locally installed `MarginPolish-HELEN` then please follow the [Local Install Usage Guide](docs/usage_local_install.md)
        
        ## Usage
        `MarginPolish` requires a draft assembly and a mapping of reads to the draft assembly. We commend using `Shasta` as the initial assembler and `MiniMap2` for the mapping.
        
        #### Step 1: Generate an initial assembly
        Although any assembler can be used to generate the initial assembly, we highly recommend using [Shasta](https://github.com/chanzuckerberg/shasta).
        
        Please see the [quick start documentation](https://chanzuckerberg.github.io/shasta/QuickStart.html) to see how to use Shasta. Shasta requires memory intensive computing.
        > For a human size assembly, AWS instance type x1.32xlarge is recommended. It is usually available at a cost around $4/hour on the AWS spot market and should complete the human size assembly in a few hours, at coverage around 60x.
        
        An assembly can be generated by running:
        ```bash
        # you may need to convert the fastq to a fasta file
        ./shasta-Linux-0.1.0 --input <reads.fa> --output <path_to_shasta_output>
        ```
        
        #### Step 2: Create an alignment between reads and shasta assembly
        We recommend using `MiniMap2` to generate the mapping between the reads and the assembly.
        ```bash
        # we recommend using FASTQ as marginPolish uses quality values
        # This command can run MiniMap2 with 32 threads, you can change the number as you like.
        minimap2 -ax map-ont -t 32 shasta_assembly.fa reads.fq | samtools sort -@ 32 | samtools view -hb -F 0x104 > reads_2_assembly.bam
        samtools index -@32 reads_2_assembly.bam
        
        #  the -F 0x104 flag removes unaligned and secondary sequences
        ```
        #### Step 3: Generate images using MarginPolish
        ##### Run MarginPolish using docker
        `MarginPolish` can be used in a docker container. You can get the image from:
        ```bash
        docker pull kishwars/margin_polish:latest
        docker run kishwars/margin_polish:latest --help
        ```
        
        To generate images with `MarginPolish` docker, first collect all your input data (`shasta_assembly.fa, reads_2_assembly.bam, allParams.np.human.guppy-ff-235.json`) to a directory i.e. `</your/data/dir>`.
        Then please run:
        ```bash
        docker run -it --rm --user=`id -u`:`id -g` --cpus=<number_of_threads> -v </your/current/directory>:/data kishwars/margin_polish:latest reads_2_assembly.bam \
        shasta_assembly.fa \
        /opt/MarginPolish/params/<model_name.json> \
        -t <number_of_threads> \
        -o output/marginpolish_images \
        -f
        ```
        
        You can get the `params.json` from `path/to/marginpolish/params/allParams.np.human.guppy-ff-235.json`.
        
        #### Step 4: Run HELEN
        
        ##### Download Model
        Before running `call_consensus.py` please download the appropriate model suitable for your data. Please read our [model guideline](#Model) to understand which model to pick.
        
        ##### Get docker images (GPU)
        Plase install `CUDA 10.0` to run the GPU supported docker for `HELEN`.
        ```bash
        sudo apt-get install nvidia-docker2
        sudo docker pull kishwars/helen:0.0.1.gpu
        sudo nvidia-docker run kishwars/helen:0.0.1.gpu call_consensus.py -h
        ```
        
        ###### Run call_consensus.py
        Please gather all your data to a input directory. Then run `call_consensus.py` using the following command:
        ```bash
        sudo nvidia-docker run -v <path/to/input>:/data kishwars/helen:0.0.1.gpu call_consensus.py \
        -i <marginpolish_images> \
        -b <batch_size> \
        -m <r941_flip235_v001.pkl> \
        -o <output_dir/> \
        -p <output_filename_prefix> \
        -w 0 \
        -t 1 \
        -g
        
        Arguments:
          -h, --help            show this help message and exit
          -i IMAGE_FILE, --image_file IMAGE_FILE
                                [REQUIRED] Path to a directory where all MarginPolish
                                generated images are.
          -m MODEL_PATH, --model_path MODEL_PATH
                                [REQUIRED] Path to a trained model (pkl file). Please
                                see our github page to see options.
          -b BATCH_SIZE, --batch_size BATCH_SIZE
                                Batch size for testing, default is 512. Please set to
                                512 or 1024 for a balanced execution time.
          -w NUM_WORKERS, --num_workers NUM_WORKERS
                                Number of workers to assign to the dataloader. Should
                                be 0 if using Docker.
          -t THREADS, --threads THREADS
                                Number of PyTorch threads to use, default is 1. This
                                may be helpful during CPU-only inference.
          -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                                Path to the output directory.
          -p OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                                Prefix for the output file. Default is:
                                HELEN_prediction
          -g, --gpu_mode        If set then PyTorch will use GPUs for inference.
        ```
        ###### Run stitch.py
        Finally you can run `stitch.py` to get a consensus sequence:
        ```bash
        sudo nvidia-docker run -v <path/to/input>:/data kishwars/helen:0.0.1.gpu \
        stitch.py \
        -i <output_dir/helen_predictions_XX.hdf> \
        -t <number_of_threads> \
        -o <output_dir/> \
        -p <output_prefix>
        
        Arguments:
          -i INPUT_HDF, --input_hdf INPUT_HDF
                                [REQUIRED] Path to a HDF5 file that was generated
                                using call consensus.
          -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                                [REQUIRED] Path to the output directory.
          -t THREADS, --threads THREADS
                                [REQUIRED] Number of threads.
          -p OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                                Prefix for the output file. Default is: HELEN_consensus
        
        ```
        
        
        ##### Get docker images (CPU) (not recommended)
        If you want to try running the inference on CPU.
        ```bash
        sudo docker pull kishwars/helen:0.0.1.cpu
        sudo docker run kishwars/helen:0.0.1.cpu call_consensus.py -h
        ```
        
        ##### Run call_consensus.py (CPU)
        Please gather all your data to a input directory. Then run `call_consensus.py` using the following command:
        ```bash
        docker run -it --rm --user=`id -u`:`id -g` --cpus=<number_of_threads> -v </your/current/directory>:/data kishwars/helen:0.0.1.cpu call_consensus.py \
        -i <marginpolish_images> \
        -b <batch_size> \
        -m <r941_flip235_v001.pkl> \
        -o <output_dir/> \
        -p <output_filename_prefix> \
        -w 0 \
        -t <number_of_threads>
        ```
        
        ##### Run stitch.py
        Finally you can run `stitch.py` to get a consensus sequence:
        ```bash
        docker run -it --rm --user=`id -u`:`id -g` --cpus=<number_of_threads> -v </your/current/directory>:/data kishwars/helen:0.0.1.cpu stitch.py \
        -i <output_dir/helen_predictions_XX.hdf> \
        -t <number_of_threads> \
        -o <output_dir> \
        -p <output_prefix>
        ```
        
        ## Models
        #### Released models
        Change in the basecaller algorithm can directly affect the outcome of HELEN. We will release trained models with new basecallers as they come out.
        <center>
        
        <table>
          <tr>
            <th>Model Name</th>
            <th>Release Date</th>
            <th>Intended base-caller</th>
            <th>Link</th>
            <th>Comment</th>
          </tr>
          <tr>
            <td>r941_flip231_v001.pkl</td>
            <td>29/05/2019</td>
            <td>Guppy 2.3.1</td>
            <td><a href="https://storage.googleapis.com/kishwar-helen/helen_trained_models/v0.0.1/r941_flip231_v001.pkl">Model_link</a></td>
            <td>The model is trained on chr1-6 of CHM13 <br>with Guppy 2.3.1 base called data.</td>
          </tr>
          <tr>
            <td>r941_flip233_v001.pkl</td>
            <td>29/05/2019</td>
            <td>Guppy 2.3.3</td>
            <td><a href="https://storage.googleapis.com/kishwar-helen/helen_trained_models/v0.0.1/r941_flip233_v001.pkl">Model_link</a></td>
            <td>The model is trained on autosomes of HG002 except <br>chr 20 with Guppy 2.3.3 base called data.</td>
          </tr>
          <tr>
            <td>r941_flip235_v001.pkl</td>
            <td>29/05/2019</td>
            <td>Guppy 2.3.5</td>
            <td><a href="https://storage.googleapis.com/kishwar-helen/helen_trained_models/v0.0.1/r941_flip235_v001.pkl">Model_link</a></td>
            <td>The model is trained on autosomes of HG002 except <br>chr 20 with Guppy 2.3.5 base called data.</td>
          </tr>
          <tr>
              <td>r941_flip305_v001.pkl</td>
              <td>06/11/2019</td>
              <td>Guppy 3.0.5</td>
              <td><a href="https://storage.googleapis.com/kishwar-helen/helen_trained_models/guppy305_trained_models/r941_flip305_helen.pkl">Model_link</a></td>
              <td>The model is trained on autosomes of HG002 except <br>chr 20 with Guppy 3.0.5 base called data.</td>
            </tr>
        </table>
        </center>
        
        We have seen significant difference in the homopolymer base-calls between different basecallers. It is important to pick the right version for the best polishing results.
        
        Confusion matrix of Guppy 2.3.1 on CHM13 chromosome X:
        <img src="img/Figure4b.png" alt="guppy235" width="1080p"> <br/>
        
        #### Model Schema
        
        HELEN implements a Recurrent-Neural-Network (RNN) based Multi-task learning model with hard parameter sharing. It implements a sliding window method where it slides through the input sequence in chunks. As each input sequence is evaluated independently, it allows HELEN to use mini-batch during training and testing.
        
        <p align="center">
        <img src="img/model_schema.svg" alt="pipeline.svg" height="640p">
        </p>
        
        ## Runtime and Cost
        `MarginPolish-HELEN` ensures runtime consistency and cost efficiency. We have tested our pipeline on `Amazon Web Services (AWS)` and `Google Cloud Platform (GCP)` to ensure scalability.
        
        We studied several samples of 50-60x coverage and created a suggestion framework for running the polishing pipeline. Please be advised that these are cost-optimized suggestions. For better run-time performance you can use more resources.
        #### Google Cloud Platform (GCP)
        For `MarginPolish` please use n1-standard-64 (64 vCPUs, 240GB RAM) instance. <br/>
        Our estimated run-time is: 12 hours
        Estimated cost for `MarginPolish`: <b>$33</b>
        
        For `HELEN`, our suggested instance type is:
        * Instance type: n1-standard-32 (32 vCPUs, 120GB RAM)
        * GPUs: 2 x NVIDIA Tesla P100
        * Disk: 2TB SSD
        * Cost: $4.65/hour
        
        The estimated runtime with this instance type is 4 hours. <br>
        The estimated cost for `HELEN` is <b>$28</b>.
        
        Total estimated run-time for polishing: 18 hours. <br/>
        Total estimated cost for polishing: <b>$61</b>
        
        #### Amazon Web Services (AWS)
        For `MarginPolish` we recommend c5.18xlarge (72 CPU, 144GiB RAM) instance. <br/>
        Our estimated run-time is: 12 hours
        Estimated cost for `MarginPolish`: <b>$39</b>
        
        We recommend using `p2.8xlarge` instance type for `HELEN`. The configuration is as follows:
        * Instance type: p2.8xlarge (32 vCPUs, 488GB RAM)
        * GPUs: 8 x NVIDIA Tesla K80
        * Disk: 2TB SSD
        * Cost: $7.20/hour
        * Suggested AMI: Deep Learning AMI (Ubuntu) Version 23.0
        
        The estimated runtime with this instance type: 4 hours <br>
        The estimated cost for `HELEN` is: <b>$36</b>
        
        Total estimated run-time for polishing: 16 hours. <br/>
        Total estimated cost for polishing: <b>$75</b>
        
        Please see our detailed [run-time case study](docs/runtime_cost.md) documentation for better insight.
        
        We also see significant improvement in time over other available polishing algorithm:
        <p align="center">
        <img src="img/Figure4d.png" alt="pipeline.svg" height="420p">
        </p>
        
        ## Results
        We compared `Medaka` and `HELEN` as polishing pipelines on Shasta assembly with `assess_assembly` module available from `Pomoxis`. The summary of the quality we produce is here:
        
        <p align="center">
        <img src="img/Figure4a.png" alt="error_rate" height=420p>
        </p>
        
        We also see that `MarginPolish-HELEN` perform consistently across multiple assemblers.
        <p align="center">
        <img src="img/Figure4c.png" alt="Multiple_assembler_error_rate" height=420p>
        </p>
        
        ## Eleven high-quality assemblies
        We have sequenced-assembled-polished 11 human genome assemblies at University of California, Santa Cruz with our pipeline. They can be downloaded from our [google bucket](https://console.cloud.google.com/storage/browser/kishwar-helen/polished_genomes/london_calling_2019/).
        
        For quick links, please copy a link from this table and you can run `wget` to download the files:
        ```bash
        wget <link>
        ```
        The twelve assemblies with their download links:
        
        <table>
          <tr>
            <th>Sample name</th>
            <th>Download link</th>
          </tr>
          <tr>
            <td>HG00733</td>
            <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/HG00733_shasta_marginpolish_helen_consensus.fa">HG00733_download_link</a></td>
          </tr>
        
          <tr>
            <td>HG01109</td>
            <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/HG01109_shasta_marginpolish_helen_consensus.fa">HG01109_download_link</a></td>
          </tr>
          <tr>
            <td>HG01243</td>
            <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/HG01243_shasta_marginpolish_helen_consensus.fa">HG01243_download_link</a></td>
          </tr>
          <tr>
            <td>HG02055</td>
            <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/HG02055_shasta_marginpolish_helen_consensus.fa">HG02055_download_link</a></td>
          </tr>
          <tr>
            <td>HG02080</td>
            <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/HG02080_shasta_marginpolish_helen_consensus.fa">HG02080_download_link</a></td>
          </tr>
          <tr>
            <td>HG02723</td>
            <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/HG02723_shasta_marginpolish_helen_consensus.fa">HG02723_download_link</a></td>
          </tr>
          <tr>
            <td>HG03098</td>
            <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/HG03098_shasta_marginpolish_helen_consensus.fa">HG03098_download_link</a></td>
          </tr>
          <tr>
            <td>HG03492</td>
            <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/HG03492_shasta_marginpolish_helen_consensus.fa">HG03492_download_link</a></td>
          </tr>
          <tr>
            <td>GM24143</td>
            <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/GM24143_shasta_marginpolish_helen_consensus.fa">GM24143_download_link</a></td>
          </tr>
          <tr>
            <td>GM24149</td>
            <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/GM24149_shasta_marginpolish_helen_consensus.fa">GM24149_download_link</a></td>
          </tr>
          <tr>
            <td>GM24385/HG002</td>
            <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/GM24385_shasta_marginpolish_helen_consensus.fa">GM24385_download_link</a></td>
          </tr>
        </table>
        
        
        We also polished `CHM13` genome assembly available from the [Telomere-to-telomere consortium](https://github.com/nanopore-wgs-consortium/CHM13) project. <br/>
        `CHM13` polished assembly is available for download from here: <a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/CHM13_shasta_marginpolish_helen_consensus.fa">CHM13_download_link</a>
        
        ## Help
        Please open a github issue if you face any difficulties.
        
        ## Acknowledgement
        We are thankful to [Segey Koren](https://github.com/skoren) and [Karen Miga](https://github.com/khmiga) for their help with `CHM13` data and evaluation.
        
        We downloaded our data from [Telomere-to-telomere consortium](https://github.com/nanopore-wgs-consortium/CHM13) to evaluate our pipeline against `CHM13`.
        
        We acknowledge the work of the developers of these packages: </br>
        * [Shasta](https://github.com/chanzuckerberg/shasta/commits?author=paoloczi)
        * [pytorch](https://pytorch.org/)
        * [ssw library](https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library)
        * [hdf5 python (h5py)](https://www.h5py.org/)
        * [pybind](https://github.com/pybind/pybind11)
        * [hyperband](https://github.com/zygmuntz/hyperband)
        
        ## Fun Fact
        <img src="https://vignette.wikia.nocookie.net/marveldatabase/images/e/eb/Iron_Man_Armor_Model_45_from_Iron_Man_Vol_5_8_002.jpg/revision/latest?cb=20130420194800" alt="guppy235" width="240p"> <img src="https://vignette.wikia.nocookie.net/marveldatabase/images/c/c0/H.E.L.E.N._%28Earth-616%29_from_Iron_Man_Vol_5_19_002.jpg/revision/latest?cb=20140110025158" alt="guppy235" width="120p"> <br/>
        
        The name "HELEN" is inspired from the A.I. created by Tony Stark in the  Marvel Comics (Earth-616). HELEN was created to control the city Tony was building named "Troy" making the A.I. "HELEN of Troy".
        
        READ MORE: [HELEN](https://marvel.fandom.com/wiki/H.E.L.E.N._(Earth-616))
        
        
        
        © 2019 Kishwar Shafin, Trevor Pesout, Benedict Paten.
        
Platform: UNKNOWN
Requires-Python: >=3.5.*
Description-Content-Type: text/markdown
