Metadata-Version: 2.1
Name: pankmer
Version: 0.11.2
Summary: Generate a PanGenome given a set of genomes
Author: Semar Petrus, Allen Mamerto, Nolan Hartwick
Author-email: Anthony Aylward <aaylward@salk.edu>
Project-URL: Homepage, https://gitlab.com/salk-tm/pankmer
Project-URL: Documentation, https://salk-tm.gitlab.io/pankmer
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Operating System :: OS Independent
Classifier: License :: OSI Approved :: BSD License
Requires-Python: <3.11,>=3.8
Description-Content-Type: text/markdown
License-File: LICENSE

Primary contact: Anthony Aylward, aaylward@salk.edu

# PanKmer

_k_-mer based and reference-free pangenome analysis. See the quickstart below, or read the [documentation](https://salk-tm.gitlab.io/pankmer/index.html).

## Installation
### With pip
```
pip install pankmer
```

### In a conda environment
First create an environment that includes all dependencies:
```
conda create -c bioconda -c conda-forge -n pankmer python==3.10 biopython==1.79 cython setuptools seaborn urllib3 wheel python-newick pyfaidx gff2bed
```
If running on OSX, a few additional packages will be required:
```
conda activate pankmer
conda install -c conda-forge clang_osx-64 clangxx_osx-64 gfortran_osx-64
```
Then install PanKmer with pip:
```
conda activate pankmer
pip install pankmer
```

### Check installation
Check that the installation was successful by running:
```
pankmer --version
```

## Tutorial
### Download example dataset

The `download_example` subcommand will download a small example dataset of
Chr19 sequences from _S. polyrhiza._
```
pankmer download_example -d .
```
After running this command the directory `PanKmer_example_Sp_Chr19/` will be present in the working directory. It contains FASTA files representing Chr19 from three genomes, and GFF files giving their gene annotations.
```
ls PanKmer_example_Sp_Chr19/*
```
```
PanKmer_example_Sp_Chr19/README.md

PanKmer_example_Sp_Chr19/Sp_Chr19_features:
Sp7498_HiC_Chr19.gff.gz Sp9509_oxford_v3_Chr19.gff3.gz Sp9512_a02_genes_Chr19.gff3.gz

PanKmer_example_Sp_Chr19/Sp_Chr19_genomes:
Sp7498_HiC_Chr19.fasta.gz Sp9509_oxford_v3_Chr19.fasta.gz Sp9512_a02_genome_Chr19.fasta.gz
```

To get started, navigate to the downloaded directory.
```
cd PanKmer_example_Sp_Chr19/
```

### Build a _k_-mer index

The _k_-mer index is a table tracking presence or absence of _k_-mers in the set of input genomes. To build an index, use the `index` subcommand and provide a directory containing the input genomes.

```
pankmer index -g Sp_Chr19_genomes/ -o Sp_Chr19_index.tar
```

After completion, the index will be present as a tar file `Sp_Chr19_index.tar`.
```
tar -tvf Sp_Chr19_index.tar
```
```
Sp_Chr19_index/
Sp_Chr19_index/kmers.b.gz
Sp_Chr19_index/metadata.json
Sp_Chr19_index/scores.b.gz
```

> #### Note
> The input genomes argument proided with the `-g` flag can be a directory, a tar archive, or a comma-separated list of FASTA files.
>
> If the output argument provided with the `-o` flag ends with `.tar`, then the index will be written as a tar archive. Otherwise it will be written as a directory.


### Create an adjacency matrix

A useful application of the _k_-mer index is to generate an adjacency matrix. This is a table of _k_-mer similarity values for each pair of genomes in the index. We can generate one using the `adj-matrix` subcommand, which will produce a CSV file containing the matrix.

```
pankmer adj-matrix -i Sp_Chr19_index.tar -o Sp_Chr19_adj_matrix.csv
```

> #### Note
> The input index argument proided with the `-i` flag can be tar archive or a directory.

### Plot a clustered heatmap

To visualize the adjacency matrix, we can plot a clustered heatmap of the adjacency values. In this case we use the Jaccard similarity metric for pairwise comparisons between genomes:

```
pankmer clustermap -i Sp_Chr19_adj_matrix.csv \
  -o Sp_Chr19_adj_matrix.svg \
  --metric jaccard \
  --width 6.5 \
  --height 6.5
```

![example heatmap](docs/source/_static/Sp_Chr19_adj_matrix.svg)

### Generate a gene variability heatmap

Generate a heatmap showing variability of genes across genomes. The following command uses the `--n-features` option to limit analysis to the first two genes from each input GFF file. The resulting image shows the level of variability observed across genes from each genome.

```
pankmer reg_heatmap -i Sp_Chr19_index/ \
  -r Sp_Chr19_genomes/Sp7498_HiC_Chr19.fasta.gz Sp_Chr19_genomes/Sp9509_oxford_v3_Chr19.fasta.gz Sp_Chr19_genomes/Sp9512_a02_genome_Chr19.fasta.gz \
  -f Sp_Chr19_features/Sp7498_HiC_Chr19.gff.gz Sp_Chr19_features/Sp9509_oxford_v3_Chr19.gff3.gz Sp_Chr19_features/Sp9512_a02_genes_Chr19.gff3.gz \
  -o Sp_Chr19_gene_var.png \
  --n-features 2 \
  --height 3
```

![example heatmap](example/Sp_Chr19_gene_variability.png)
