<p align="center">
<img alt="ProHarMeD Logo" src="https://github.com/symbod/proharmed/blob/main/Logo.png?raw=true" width="500" />
</p>

# ProHarMeD - Proteomic Meta-Study Harmonization, Mechanotyping and Drug Repurposing Prediction

##Introduction
<span style="color:red">**TODO**</span>

This repository comprises four main harmonization functionalities:
- filter protein IDs
- remap gene names
- reduce gene names
- map orthologs

Additionally, it offers following meta anylsis functionalities:
- intersection anaysis
- disease mechanism mining and drug repurposing

A detailed tutorial with example data on how to use the mqhandler python package can be found [here](https://github.com/symbod/MaxQuantHandler/blob/main/tutorial.ipynb).

## Installation

```shell
pip install mqhandler
```

## 1. Filter Protein IDs

For a protein assignment using MaxQuant, Fasta files are required. Since MaxQuant can also be used to run several data collectively, 
it can also happen that results are provided with protein IDs of several organisms.

This method makes it possible to check the protein IDs for their organism by directly accessing the Uniprot database, and to 
remove incorrectly assigned IDs. Additionally, decoy (REV_) and contaminants (CON_) IDs and/or unreviewed protein IDs can be removed.

One might be interested to know how many IDs were filtered out, in total and per row. Therefore, with this call, you can generate 2 data frames that display this information as a table.

In addition to the information as a table, it can also be displayed directly as plots with a simple call.


#### 1.1 Imports
```python
import pandas as pd
from proharmed import filter_ids as fi
from proharmed.mq_utils.runner_utils import find_delimiter
```
#### 1.2 Load Your Data
```python
# load data into a dataframe with automated delimiter finder
data = pd.read_table(file, sep=find_delimiter(<file>)).fillna("")
```
#### 1.3 Set Preferences
```python
# mandatory
protein_column = "Protein IDs" # Name of column with protein IDs

# optional
organism = "rat" # Specify organism the IDs should match to
rev_con = False # Bool to indicate if protein IDs from decoy (REV__) and contaminants (CON__) should be kept
reviewed = False # Bool to indicate if newly retrieved protein IDS should be reduced to reviewed ones
keep_empty = False # Bool to indicate if empty ID cells should be kept or deleted
res_column = None # Name of column for filer protein IDs results. If None, the protein_column will be overridden
```
#### 1.4 Run filter_protein_ids
```python
fi_data, fi_log_dict = fi.filter_protein_ids(data = data, protein_column = protein_column, 
                                             organism = organism, rev_con = rev_con, keep_empty = keep_empty, 
                                             reviewed = reviewed, res_column = res_column)
```

#### 1.5 Inspect Logging
```python
from proharmed.mq_utils import plotting as pt
pt.create_overview_plot(fi_log_dict["Overview_Log"], out_dir = out_dir)
pt.create_filter_detailed_plot(fi_log_dict["Detailed_Log"], organism = organism, 
                               reviewed = reviewed, decoy = rev_con, out_dir = out_dir)
```


## 2. Remap Gene Names

Besides protein IDs, gene names are also taken out of the respective Fasta files and mapped. These are needed for easier naming in plots and in analytical procedures such as enrichment analysis. Unfortunately, Fasta files are not always complete in terms of gene names.

This method makes it possible to retrieve the assigned gene names based on the protein IDs with direct access to the Uniprot database and to fill the empty entries in the user file or even replace existing entries. There are multiple possible modes for which names should be taken.

Here, too, it is possible to subsequently obtain information on how many gene names were found for how many rows.

#### 2.1 Imports
```python
import pandas as pd
from proharmed import remap_genenames as rmg
from proharmed.mq_utils.runner_utils import find_delimiter
```
#### 2.2 Load Your Data
```python
# load data into a dataframe with automated delimiter finder
data = pd.read_table(file, sep=find_delimiter(<file>)).fillna("")
```
#### 2.3 Set Preferences
```python
# mandatory
mode = "uniprot_primary" # Mode of refilling. See below for more infos.
protein_column = "Protein IDs" # Name of column with protein IDs

# optional
gene_column = "Gene names" # Name of column with gene names
skip_filled = False # Bool to indicate if already filled gene names should be skipped
organism = "rat" # Specify organism the IDs should match to
fasta = None # Path of Fasta file when mode all or fasta
keep_empty = False # Bool to indicate if empty gene names cells should be kept or deleted
res_column = None # Name of column for remap gene names results. If None, the gene_column will be overridden
```
Modes of refilling:
- all: use primarily fasta infos and additionally uniprot infos
- fasta: use information extracted from fasta headers
- uniprot: use mapping information from uniprot and use all gene names
- uniprot_primary: use mapping information from uniprot and only all primary gene names
- uniprot_one: use mapping information from uniprot and only use most frequent single gene name

#### 2.4 Run remap_genenames
```python
rmg_data, rmg_log_dict = rmg.remap_genenames(data = data, mode=mode, protein_column = protein_column,
                                            gene_column = gene_column, skip_filled = skip_filled, organism = organism, 
                                             fasta = fasta, keep_empty = keep_empty, res_column = res_column)
```

## 3. Reduce Gene Names
A well-known problem with gene symbols is that they are not unique and slight changes in spelling can lead to problems. Often there are different gene symbols for the same gene in UniProt. Depending on which protein IDs you used to get the gene symbol, you can get multiple gene symbols for the same gene by using the previous remap function.

This method makes it possible to reduce the gene symbols to a common gene symbol using different features and databases, thus preventing redundancy. There are multiple possible modes for which names should be taken.

Here, too, it is possible to subsequently obtain information on how many gene names were reduced for how many rows. This can also be displayed as a plot with a simple call.

#### 3.1 Imports
```python
import pandas as pd
from proharmed import reduce_genenames as rdg
from proharmed.mq_utils.runner_utils import find_delimiter
```

#### 3.2 Load Your Data
```python
# load data into a dataframe with automated delimiter finder
data = pd.read_table(file, sep=find_delimiter(<file>)).fillna("")
```

#### 3.3 Set Preferences
```python
# mandatory
mode = "ensembl" # Mode of reduction. See below for more infos-
gene_column = "Gene names" # Name of column with gene names
organism = "rat" # Specify organism the IDs should match to

# optional
res_column = False # Name of column of reduced gene names results. If None, the gene_column will be overridden
keep_empty = False # Bool to indicate if empty reduced gene names cells should be kept or deleted
HGNC_mode = None # Mode on how to reduce the gene names using HGNC (mostfrequent, all)
```

Modes of reduction:
 - ensembl: use gProfiler to reduce gene names to those having an Ensembl ID
 - HGNC: use HGNC database to reduce gene names to those having an entry in HGNC (only for human)
 - mygeneinfo: Use mygeneinfo database to reduce gene names to those having an entry in mygeneinfo
 - enrichment: Use gProfiler to reduce gene names to those having a functional annotation
 

#### 3.4 Run reduce_genenames
```python
rdg_data, rdg_log_dict = rdg.reduce_genenames(data = rmg_data, mode = mode, gene_column = gene_column, 
                                              organism = organism, res_column = res_column, keep_empty = keep_empty,
                                             HGNC_mode = HGNC_mode)
```

#### 3.5 Inspect Logging
```python
from proharmed.mq_utils import plotting as pt
pt.create_overview_plot(rdg_log_dict["Overview_Log"], out_dir = out_dir)
pt.create_reduced_detailed_plot(rdg_log_dict["Detailed_Log"], out_dir = out_dir)
```



## 4. Get Orthologs

Suppose you want to compare data between organisms, for example if you want to do a review across several species, you come across a known problem. Gene names differ between species, making it necessary to map all IDs to a selected organism through an ortholog mapping.

Using the commonly used gProfiler, this method simply maps the gene names from the current organism to the target organism.

Unfortunately, depending on the original and target organism, there are more or less cases where no orthologous gene could be found. For a simplified overview of how many cases this was the case, this method can be used to obtain this information.

As with the previous tasks, the log information can be displayed in plots.


#### 3.1 Imports
```python
import pandas as pd
from proharmed import map_orthologs as mo
from proharmed.mq_utils.runner_utils import find_delimiter
```

#### 3.2 Load Your Data
```python
# load data into a dataframe with automated delimiter finder
data = pd.read_table(file, sep=find_delimiter(<file>)).fillna("")
```

#### 3.3 Set Preferences
```python
# mandatory
gene_column = "Gene names" # Name of column with gene names
source_organism = "rat" # Specify organism the IDs match to
tar_organism = "human" # Specify organism the IDs should me mapped to

# optional
keep_empty = False # Bool to indicate if empty ortholog gene names cells should be kept or deleted
res_column = None # Name of column of orthologs gene names results. If None, the gene_column will be overridden```
```

#### 3.4 Run map_orthologs
```python
mo_data, mo_log_dict = mo.map_orthologs(data = data, gene_column = gene_column, organism = source_organism,
                                           tar_organism = tar_organism, keep_empty = keep_empty, 
                                            res_column = res_column)
```

#### 3.5 Inspect Logging
```python
from proharmed.mq_utils import plotting as pt
pt.create_overview_plot(mo_log_dict["Overview_Log"], out_dir = out_dir)
pt.create_ortholog_detailed_plot(mo_log_dict["Detailed_Log"], organism = organism, out_dir = out_dir)
```
