Metadata-Version: 2.1
Name: medicc2
Version: 0.6b0
Summary: Whole-genome doubling-aware copy number phylogenies for cancer evolution
Home-page: https://bitbucket.org/schwarzlab/medicc2
Author: Tom L Kaufmann, Marina Petkovic, Roland F Schwarz
Author-email: tkau93@gmail.com, marina.55kovic@gmail.com, roland.f.schwarz@gmail.com
License: GPL-3
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
License-File: LICENSE

# MEDICC2 - Whole-genome doubling-aware copy number phylogenies for cancer evolution

For more information see the accompanying  paper [Whole-genome doubling-aware copy number phylogenies for cancer evolution with MEDICC2](https://www.biorxiv.org/content/10.1101/2021.02.28.433227v2).

# Installation
Install MEDICC2 via conda (recommended), pip or from source. MEDICC2 was developed and tested on unix-built systems (Linux and MacOS). For Windows users we recommended WSL2.

Note that the notebooks and examples are not included when installing from conda or pip. When installing from pip or source, you need to make sure to have a working version of `gcc` and `gxx` installed.

## Installation via conda (recommended)
MEDICC2 can be installed via `conda install -c bioconda -c conda-forge medicc2`.

## Installation via pip
As MEDICC2 relies on OpenFST version 1.8.1 which is not packaged on PyPi you have to first install it using conda with `conda install -c conda-forge openfst`. Next you can install MEDICC2 via `pip install medicc2`.

## Installation from source
Clone the MEDICC2 repository and its submodules using `git clone --recursive https://bitbucket.org/schwarzlab/medicc2.git`. It is important to use the `--recursive` flag to also download the modified OpenFST submodule.

All dependencies including OpenFST (v1.8.1) should be directly installable via conda. A yaml file with a suggested MEDICC2 conda environment is provided in 'doc/medicc2.yml'. You can create a new conda environment with all requirements using `conda env create -f doc/medicc2.yml -n medicc_env`.

Then, inside the `medicc2` folder, run `pip install .` to install MEDICC2 to your environment. 

# Usage
After installing MEDICC2, you can use MEDICC2 functions in python scripts (through `import medicc`) and from the command line. General usage from the command line is `medicc2 path/to/input/file path/to/output/folder`. Run `medicc2 --help` for information on optional arguments.

Logging settings can be changed using the `medicc/logging_conf.yaml` file with the standard python logging syntax.

## Command line Flags

* `input_file`: path to the input file
* `output_dir`: path to the output folder
* `--input-type`, `-i`: Choose the type of input: f for FASTA, t for TSV. Default: 'TSV'
* `--input-allele-columns`, `-a`: Name of the CN columns (comma separated) if using TSV input format. This also adjusts the number of alleles considered (min. 1, max. 2). Default: 'cn_a, cn_b'
* `--input-chr-separator`: Character used to separate chromosomes in the input data (condensed FASTA only). Default: 'X'
* `--tree`: Do not reconstruct tree, use provided tree instead (in newick format) and only perform ancestral reconstruction. Default: None
* `--topology-only`, `-s`: Output only tree topology, without reconstructing ancestors. Default: False
* `--normal-name`, `-n`: ID of the sample to be treated as the normal sample. Trees are rooted at this sample for ancestral reconstruction. If the sample ID is not found, an artificial normal sample of the same name is created with CN states = 1 for each allele. Default: 'diploid'
* `--exclude-samples`, `-x`: Comma separated list of sample IDs to exclude. Default: None
* `--filter-segment-length`: Removes segments that are smaller than specified length. Default: None
* `--bootstrap-method`: Bootstrap method. Has to be either 'chr-wise' or 'segment-wise'. Default: 'chr-wise'
* `--bootstrap-nr`: Number of bootstrap runs to perform. Default: None
* `--prefix`, '-p': Output prefix to be used. None uses input filename. Default: None
* `--no-wgd`: Disable whole-genome doubling events. Default: False
* `--no-plot`: Disable plotting. Default: False
* `--total-copy-numbers`: Run for total copy number data instead of allele-specific data. Default: False
* `-j`, `--n-cores`: Number of cores to run on. Default: None
* `--chromosomes-bed`: BED file for chromosome regions to compare copy-number events to
* `--regions-bed`: BED file for regions of interest to compare copy-number events to
* `-v`, `--verbose`: Enable verbose output. Default: False
* `-vv`, `--debug`: Enable more verbose output Default: False
* `--maxcn`: Expert option: maximum CN at which the input is capped. Does not change FST. Default: 8
* `--prune-weight`: Expert option: Prune weight in ancestor reconstruction. Values >0 might result in more accurate ancestors but will require more time and memory. Default: 0
* `--fst`: Expert option: path to an alternative FST. Default: None
* `--fst-chr-separator`: Expert option: character used to separate chromosomes in the FST. Default: 'X'


## Input files
Input files can be either in fasta or tsv format:
* **fasta:** A description file should be provided to MEDICC. This file should include one line per file with the name of the chromosome and the corresponding file names. If fasta files are provided you have to use the flag `--input-type fasta`.
* **tsv:** Files should have the following columns: `sample_id`, `chrom`, `start`, `end` as well as columns for the copy numbers. MEDICC expects the copy number columns to be called `cn_a` and `cn_b`. Using the flag `--input-allele-columns` you can set your own copy number columns. If you want to use total copy numbers, make sure to use the flag `--total-copy-numbers`. Important: MEDICC2 does not create total copy numbers for you. You will have to calculate total copy numbers yourself and then specify the column using the `--input-allele-columns` flag.

MEDICC2 follows the BED convention for segment coordinates, i.e. segment start is at 0 and the segment end is non-inclusive.

The folder `examples/simple_example` contains a simple example input both in fasta and tsv format.
The folder `examples/OV03-04` contains a larger example consisting of multiple fasta files. If you want to run MEDICC on this data run `medicc2 examples/OV03-04/OV03-04_descr.txt path/to/output/folder --input-type fasta`.


## Output files
MEDICC creates the following output files:
* `_final_tree.new`, `_final_tree.xml`, `_final_tree.png`: The final phylogenetic tree in Newick and XML format as well as an image
* `_pairwise_distances.tsv`: A NxN matrix (N being the number of samples) of pairwise distances calculated with the symmetric MEDICC2 distance
* `_final_cn_profiles.tsv`: Copy-number profiles of the input as well as the newly internal nodes. Also includes additional information such as whether a gain or loss has happened
* `_copynumber_events_df.tsv`: List of all copy-number events detected 
* `_cn_profiles.pdf`: Combined plot of the phylogenetic tree as well as the copy-number profiles of all samples (including the internal nodes)
* `_events_overlap.tsv`: Overlap of copy-number events with regions of interest (see below)


## Output plots
The file `_cn_profiles.pdf` contains most of the information of the MEDICC2 output. The left part consists of the inferred phylogenetic tree including the number of events in the branches. The right part is made up of the copy-number profiles of the samples as well as the reconstructed ancestral nodes. Copy-number events are also marked in the respective copy-number profiles where they appear.

### Example
Example from patient PTX011 from the Gundem et al. Nature 2015. The data can be found in `example/gundem_et_al_2015/`.

![copy-number plot for PTX011 Gundem 2015](doc/MEDICC2_cn_plot_example.png)


### Legend

![legend of copy-number plot](doc/MEDICC2_cn_plot_legend.png)

## Usage examples
For first time users we recommend to have a look at `examples/simple_example` to get an idea of how input data should look like. Then run `medicc2 examples/simple_example/simple_example.tsv path/to/output/folder` as an example of a standard MEDICC run. Finally, the notebook `notebooks/example_workflows.py` shows how the individual functions in the workflow are used.

The notebook `notebooks/bootstrap_demo.py` demonstrates how to use the bootstrapping routine and `notebooks/plot_demo.py` shows how to use the main plotting functions.


## Regions of interest
MEDICC2 compares the detected copy-number events to regions of interest. These regions are chromosome-boundaries and known oncogenes and tumor-suppressor genes. By default MEDICC2 uses hg38 chromosome-arms and a list of genes taken from Davoli et al. Cell 2013. This data is present as BED files in the `medicc/objects` folder.

Users can specify regions of interest of their own in BED format by providing the `--chromosomes-bed` or `--regions-bed` flags.


# Issues
If you experience problems with MEDICC2 please [file an issue directly on Bitbucket](https://bitbucket.org/schwarzlab/medicc2/issues/new) or [contact us directly](tom.kaufmann@mdc-berlin.de). 

## Known Issues

**Noisy segments**
Small faulty or noisy segments can have a strong effect on the distances MEDICC2 calculates between samples and therefore the resulting tree.
This is because MEDICC2 counts all segments equally in order appropriatlely take focal events into account. 
If the resulting and the inferred events look strange to you, you can replot the tree and copy-number profiles using the function `plot_cn_profiles` setting `ignore_segment_lengths=True` (see the notebook `notebooks/plot_demo.py` for usage examples) in order to investigate small segments that might not have been visible in the original plot.
If you are unsure about the copy-number profiles we recommened to filter small segments.

**Taxon imbalance**
If your data contains 100s to 1000s samples with a few distinct subgroups, an imbalance in the number of samples per subgroups might lead to an incorrect tree (e.g. 50 samples of subclone A and 1000 samples each of subclone B and C).
This is a known problem in phylogeny called *taxon imbalance* or *taxon sampling*. If you have multiple, clearly separable subgroups in your data we recommoned either subsampling over-represented groups or upsampling under-represented groups to gauge the effect of taxon imbalance.

**Running out of memory / bad_alloc error**
If MEDICC2 terminates with the following error `terminate called after throwing an instance of 'std::bad_alloc'` or your machine runs out of memory this hints towards an issue with the FST.
Rerun MEDICC2 with the `-vv` flag to enable extended logging. If the error occurs during the ancestral reconstruction routine, the issue is related to OpenFST which is the FST library employed by MEDICC2 and cannot be easily solved by us.
This issue can be related to small bin sizes (and therefore a large number of segments). Increasing the binsize (although decreasing accuracy) solves this issue most of the time.
You can also try to remove the sample that led to the error (see the extended logs for this). 

# Contact
Email questions, feature requests and bug reports to **Tom Kaufmann, tom.kaufmann@mdc-berlin.de**.

# License
MEDICC2 is available under [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html). It contains modified code of the *pywrapfst* Python module from [OpenFST](http://www.openfst.org/) as permitted by the [Apache 2](http://www.apache.org/licenses/LICENSE-2.0) license.

# Please cite
Kaufmann TL, Petkovic M, Watkins TBK, Colliver EC, Laskina S, Thapa N, Minussi DC, Navin N, Swanton C, Van Loo P, Haase K, Tarabichi M, Schwarz RF.
**MEDICC2: whole-genome doubling aware copy-number phylogenies for cancer evolution**  
bioRxiv 2021 Sep 6; doi: 10.1101/2021.02.28.433227 

Schwarz RF, Trinh A, Sipos B, Brenton JD, Goldman N, Markowetz F.  
**Phylogenetic quantification of intra-tumour heterogeneity.**  
PLoS Comput Biol. 2014 Apr 17;10(4):e1003535. doi: 10.1371/journal.pcbi.1003535.


