Metadata-Version: 2.1
Name: apyxl
Version: 0.1.2
Summary: A Python package for data analysis and model optimization.
Home-page: https://github.com/CyrilJl/apyxl
Author: Cyril Joly
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: xgboost>=2.0.0
Requires-Dist: scikit-learn
Requires-Dist: shap
Requires-Dist: hyperopt
Requires-Dist: matplotlib

# <img src="https://raw.githubusercontent.com/CyrilJl/apyxl/main/_static/logo.svg" alt="Logo OptiMask" width="40" height="40"> apyxl

The `apyxl` package (Another PYthon package for eXplainable Learning) is a simple wrapper around [`xgboost`](https://xgboost.readthedocs.io/en/stable/python/index.html), [`hyperopt`](https://hyperopt.github.io/hyperopt/), and [`shap`](https://shap.readthedocs.io/en/latest/). It provides the user with the ability to build a performant regression or classification model and use the power of the SHAP analysis to gain a better understanding of the links the model builds between its inputs and outputs. With `apyxl`, processing categorical features, fitting the model using Bayesian hyperparameter search, and instantiating the associated SHAP explainer can all be accomplished in a single line of code, streamlining the entire process from data preparation to model explanation.

## Current Features

- Automatic One-Hot-Encoding for categorical variables
- Bayesian hyperparameter optimization using `hyperopt`
- Simple explainability visualizations using `shap` (`beeswarm`, `decision`, `force`, `scatter`)
- Focus on classification and regression tasks

## Planned Enhancements

- Time-series data handling and normalization
- A/B test analysis capabilities

## Installation

To install the package, use:

```bash
pip install apyxl
```

## Basic Usage

### 1. Regression

```python
from apyxl import XGBRegressorWrapper
from sklearn.datasets import fetch_california_housing

X.shape, y.shape
>>> ((20640, 8), (20640,))

model = XGBRegressorWrapper().fit(X, y)
# defaults to r2 score
model.best_score
>>> 0.6671771984999055

# Plot methods can handle internally the computation of the SHAP values
model.beeswarm(X=X.sample(2_500))
```

<img src="https://raw.githubusercontent.com/CyrilJl/apyxl/main/_static/a.png" width="500">

```python
model.scatter(X=X.sample(2_500), feature='Latitude')
```

<img src="https://raw.githubusercontent.com/CyrilJl/apyxl/main/_static/b.png" width="500">

### 2. Classification

```python
from apyxl import XGBClassifierWrapper
from sklearn.datasets import fetch_covtype

X, y = fetch_covtype(as_frame=True, return_X_y=True)
y -= 1
y.unique()
>>> array([4, 1, 0, 6, 2, 5, 3])

X.shape, y.shape
>>> ((581012, 54), (581012,))

# To speed up the process, Bayesian hyperparameter optimization can be performed on a subset of the 
# dataset. The model is then fitted on the entire dataset using the optimized hyperparameters.
model = XGBClassifierWrapper().fit(X, y, n=25_000)
# defaults to Matthews correlation coefficient
model.best_score
>>> 0.5892932365687379

# Computing SHAP values can be resource-intensive, so it's advisable to calculate them once for
# multiple future uses, especially in multiclass classification scenarios where the cost is even
# higher compared to binary classification (shap values shape equals (n_samples, n_features, n_classes))
shap_values = model.compute_shap_values(X.sample(1_000))
shap_values.shape
>>> (1000, 54, 7)
# The `output` argument selects the shap values associated to the desired class
model.beeswarm(shap_values=shap_values, output=2, max_display=15)
```

<img src="https://raw.githubusercontent.com/CyrilJl/apyxl/main/_static/c.png" width="500">

```python
model.scatter(shap_values=shap_values, feature='Elevation', output=4)
```

<img src="https://raw.githubusercontent.com/CyrilJl/apyxl/main/_static/d.png" width="500">


### 3. Time Series Normalization - A/B tests
#### 3.1. Time Series Normalization
Weather normalization for time series is a trend discovery analysis that has long been used in weather-dependent applications (such as energy consumption or  [air pollution](https://github.com/skgrange/normalweatherr)). My research suggests that it is equivalent to a SHAP analysis, treating time as a simple numeric variable. Tree-based methods like gradient boosting are particularly well-suited for discovering breakpoint changes, as they recursively split the dataset along one variable and one threshold.

```python
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from apyxl import XGBRegressorWrapper

n = 8760
time = pd.date_range(start='2024-01-01', freq='h', periods=n)

# Generate two correlated time series, `a` and `b`
cov = [[1, 0.7], [0.7, 1]]
mean = [0, 5]

df = np.random.multivariate_normal(cov=cov, mean=mean, size=n)
df[:, 1] *= 2

# Shift time serie `b` on a continuous subset of the period
df[6000:7000, 1] += 2

df = pd.DataFrame(df, columns=['a', 'b'], index=time)

df.plot(lw=0.7)
plt.show()
```

<img src="https://raw.githubusercontent.com/CyrilJl/apyxl/main/_static/e.png" width="500">

```python
# process time index as a simple numeric variable, i.e. the number of
# days since the beginning of the dataset (could have been another time unit)
df['time_numeric'] = ((df.index - df.index.min())/pd.Timedelta(days=1)).astype(int)

# `apyxl` can be then used as:
target = 'b'
X, y = df.drop(columns=target), df[target]
model = XGBRegressorWrapper(random_state=0).fit(X, y)
model.scatter(X, feature='a')
model.scatter(X, feature='time_numeric')
```

<img src="https://raw.githubusercontent.com/CyrilJl/apyxl/main/_static/f.png" width="500">

<img src="https://raw.githubusercontent.com/CyrilJl/apyxl/main/_static/g.png" width="500">

The fitted XGBoost regressor manages to capture the linear relationship between `a` and `b` (with the exception of extreme values) as well as the temporary, time-localized shift between the two time series. This trend, in other words the behavior of `b` that can't be explained by `a`, can be isolated:

```python
shap_values = model.compute_shap_values(X)
pd.Series(shap_values[:, 'time_numeric'].values, index=X.index).plot(title='time series `b` normalized by `a`')
plt.show()
```

<img src="https://raw.githubusercontent.com/CyrilJl/apyxl/main/_static/h.png" width="500">

#### 3.2. A/B tests
Let's now look at our dataset in a different way:
```python
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from apyxl import XGBRegressorWrapper

n = 8760
time = pd.date_range(start='2024-01-01', freq='h', periods=n)

# Generate two correlated time series, `a` and `b`
cov = [[1, 0.7], [0.7, 1]]
mean = [0, 5]

df = np.random.multivariate_normal(cov=cov, mean=mean, size=n)
df[:, 1] *= 2

# Shift time serie `b` on a continuous subset of the period
df[6000:7000, 1] += 2

df = pd.DataFrame(df, columns=['a', 'b'], index=time).rename_axis(index='time', columns='id')
df = df.stack().rename('value').reset_index().set_index('time')
df['time_numeric'] = ((df.index-df.index.min())/pd.Timedelta(days=1)).astype(int)
df.sample(5)

>>>                     id     value  time_numeric
>>> time                                          
>>> 2024-12-24 05:00:00  a  1.944142           358
>>> 2024-09-01 11:00:00  a -0.528874           244
>>> 2024-10-26 22:00:00  b  7.377142           299
>>> 2024-04-17 03:00:00  a  0.744991           107
>>> 2024-12-15 11:00:00  b  8.370796           349
```

We are now dealing with less structured data, with a value of interest and two different ids. Does the behavior of `value` change over time differently according to the ids?

```python
target = 'value'
X, y = df.drop(columns=target), df[target]
model = XGBRegressorWrapper(max_evals=25).fit(X, y)
model.beeswarm(X)
```

<img src="https://raw.githubusercontent.com/CyrilJl/apyxl/main/_static/i.png" width="500">

```python
model.scatter(X, feature='time_numeric')
```

<img src="https://raw.githubusercontent.com/CyrilJl/apyxl/main/_static/j.png" width="500">

The SHAP analysis is clearly able to isolate relative changes of correlated time series over time.

The approach showcased in this package, which utilizes tree-based models like XGBoost for time series normalization and A/B testing, shares conceptual similarities with certain econometric techniques. For instance, methods such as difference-in-differences (DiD) and fixed effects models are traditionally employed to isolate the impact of a treatment or an event over time, controlling for confounding factors. These econometric techniques also aim to discern underlying trends by accounting for both time-variant and invariant factors. The package's application of SHAP values for interpreting model outputs offers a novel way to quantify the impact of variables, much like how econometric models quantify the effects of covariates. A future comparison between this machine learning-based approach and traditional econometric methods could reveal interesting insights, particularly in the context of non-linear relationships and the ability to capture complex interactions in time series data.

## Note

Please note that this package is still under development, and features may change or expand in future versions.
