Metadata-Version: 2.1
Name: causalnlp
Version: 0.1.0
Summary: CausalNLP
Home-page: https://github.com/amaiya/causalnlp/tree/main/
Author: Arun S. Maiya
Author-email: arun@maiya.net
License: Apache Software License 2.0
Keywords: causality nlp causal-inference natural-language-processing
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Provides-Extra: dev
License-File: LICENSE

# CausalNLP
> CausalNLP is a practical toolkit for causal inference with text


## Install

1. `pip install -U pip`
2. `pip install causalnlp`

## Usage

### Example: What is the causal impact of a positive review on a product click?

```
import pandas as pd
df = pd.read_csv('sample_data/music_seed50.tsv', sep='\t', error_bad_lines=False)
```

The file `music_seed50.tsv` is a semi-simulated dataset from [here](https://github.com/rpryzant/causal-text). Columns of relevance include:
- `Y_sim`: simulated outcome, where 1 means product was clicked and 0 means not. 
- `C_true`:confounding categorical variable (1=audio CD, 0=other)
- `T_true`: 1 means rating less than 3, 0 means rating of 5, where `T_true` affects the outcome `Y_sim`.
- `T_ac`: An approximation of true review sentiment (`T_true`) created with [Autocoder](https://amaiya.github.io/causalnlp/autocoder.html).

We'll pretend the true sentiment (i.e., review rating and `T_true`) is hidden and only use `T_ac` as the treatment variable. 

Using the `text_col` parameter, we include the raw review text as another "controlled-for" variable.

```
from causalnlp.causalinference import CausalInferenceModel
from lightgbm import LGBMClassifier
cm = CausalInferenceModel(df, 
                         metalearner_type='t-learner', learner=LGBMClassifier(num_leaves=500),
                         treatment_col='T_ac', outcome_col='Y_sim', text_col='text',
                         include_cols=['C_true'])
cm.fit()
```

    outcome column (categorical): Y_sim
    treatment column: T_ac
    numerical/categorical covariates: ['C_true']
    text covariate: text
    preprocess time:  1.1216762065887451  sec
    start fitting causal inference model
    time to fit causal inference model:  9.701336860656738  sec


#### Results

We can calculate the average treatment effect to find that a positive review increases the probability of a click by 13 percentage points in this dataset.

The average treatment effect (ATE):

```
print( cm.estimate_ate() )
```

    {'ate': 0.1309311542209525}


The conditional average treatment effect (CATE) for those reviews that mention the word "toddler":

```
print( cm.estimate_ate(df['text'].str.contains('toddler')) )
```

    {'ate': 0.15559234254638685}


Features most predictive of the treatment effects (e.g., increase in probability of clicking product):

```
print( cm.interpret(plot=False)[1][:10] )
```

    v_music    0.079042
    v_cd       0.066838
    v_album    0.055168
    v_like     0.040784
    v_love     0.040635
    C_true     0.039949
    v_just     0.035671
    v_song     0.035362
    v_great    0.029918
    v_heard    0.028373
    dtype: float64


Features with the `v_` prefix are word features. `C_true` is the categorical variable indicating whether or not the product is a CD. 

## Documentation
API documentation and additional usage examples are available at: https://amaiya.github.io/causalnlp/

## How to Cite

Please cite [the following paper](https://arxiv.org/abs/2106.08043) when using CausalNLP in your work:

```
@article{maiya2021causalnlp,
    title={CausalNLP: A Practical Toolkit for Causal Inference with Text},
    author={Arun S. Maiya},
    year={2021},
    eprint={2106.08043},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    journal={arXiv preprint arXiv:2106.08043},
}
```


