Metadata-Version: 2.1
Name: chatdesk-grouphug
Version: 0.8.1
Summary: GroupHug is a library with extensions to 🤗 transformers for multitask language modelling.
Home-page: https://github.com/sanderland/grouphug
License: Apache2
Keywords: transformers,language modelling,machine learning,classification
Author: Sander Land
Requires-Python: >=3.8,<4.0
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: Unidecode (>=1.3.4,<2.0.0)
Requires-Dist: datasets (>=2.0.0,<3.0.0)
Requires-Dist: demoji (>=1.1.0,<2.0.0)
Requires-Dist: evaluate (>=0.3.0,<0.4.0)
Requires-Dist: numpy (>=1.21,<2.0)
Requires-Dist: regex (>=2022.3.15,<2023.0.0)
Requires-Dist: sentencepiece (>=0.1.96,<0.2.0)
Requires-Dist: torch (>=1.10.0,<2.0.0)
Requires-Dist: transformers (>=4.20.0,<5.0.0)
Description-Content-Type: text/markdown


# grouphug

GroupHug is a library with extensions to 🤗 transformers for multitask language modelling.
In addition, it contains utilities that ease data preparation, training, and inference.

## Project Moved

Grouphug maintenance and future versions have moved to [my personal repository](https://github.com/sanderland/grouphug).

## Overview

The package is optimized for training a single language model to make quick and robust predictions for a wide variety of related tasks at once,
 as well as to investigate the regularizing effect of training a language modelling task at the same time.

You can train on multiple datasets, with each dataset containing an arbitrary subset of your tasks. Supported tasks include: 

* A single language modelling task (Masked language modelling, Masked token detection, Causal language modelling).
  * The default collator included handles most preprocessing for these heads automatically.
* Any number of classification tasks, including single- and multi-label classification and regression
  * A utility function that automatically creates a classification head from your data. 
  * Additional options such as hidden layer size, additional input variables, and class weights.
* You can also define your own model heads.

## Quick Start

The project is based on Python 3.8+ and PyTorch 1.10+. To install it, simply use:

`pip install grouphug`

### Documentation

Documentation can be generated from docstrings using `make html` in the `docs` directory, but this is not yet on a hosted site. 

### Example usage

```python
import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer

from grouphug import AutoMultiTaskModel, ClassificationHeadConfig, DatasetFormatter, LMHeadConfig, MultiTaskTrainer

# load some data. 'label' gets renamed in huggingface, so is better avoided as a feature name.
task_one = load_dataset("tweet_eval",'emoji').rename_column("label", "tweet_label")
both_tasks = pd.DataFrame({"text": ["yay :)", "booo!"], "sentiment": ["pos", "neg"], "tweet_label": [0,14]})

# create a tokenizer
base_model = "prajjwal1/bert-tiny"
tokenizer = AutoTokenizer.from_pretrained(base_model)

# preprocess your data: tokenization, preparing class variables
formatter = DatasetFormatter().tokenize().encode("sentiment")
# data converted to a DatasetCollection: essentially a dict of DatasetDict
data = formatter.apply({"one": task_one, "both": both_tasks}, tokenizer=tokenizer, test_size=0.05)

# define which model heads you would like
head_configs = [
    LMHeadConfig(weight=0.1),  # default is BERT-style masked language modelling
    ClassificationHeadConfig.from_data(data, "sentiment"),  # detects dimensions and type
    ClassificationHeadConfig.from_data(data, "tweet_label"),  # detects dimensions and type
]
# create the model, optionally saving the tokenizer and formatter along with it
model = AutoMultiTaskModel.from_pretrained(base_model, head_configs, formatter=formatter, tokenizer=tokenizer)
# create the trainer
trainer = MultiTaskTrainer(
    model=model,
    tokenizer=tokenizer,
    train_data=data[:, "train"],
    eval_data=data[["one"], "test"],
    eval_heads={"one": ["tweet_label"]},  # limit evaluation to one classification task
)
trainer.train()
```

### Tutorials

See [examples](./examples) for a few notebooks that demonstrate the key features.

## Supported Models

The package has support for the following base models:

* Bert, DistilBert, Roberta/DistilRoberta, XLM-Roberta 
* Deberta/DebertaV2
* Electra
* GPT2, GPT-J, GPT-NeoX, OPT

Extending it to support other models is possible by simply inheriting from `_BaseMultiTaskModel`, although language modelling head weights may not always load. 

## Limitations

* The package only supports PyTorch, and will not work with other frameworks. There are no plans to change this.
* Grouphug was developed and tested with 🤗 transformers 4.19-4.22. We will aim to test and keep compatibility with the latest version, but it is still recommended to lock the latest working versions. 

See the [contributing page](CONTRIBUTING.md) if you are interested in contributing.

## License

grouphug was developed by [Chatdesk](http://www.chatdesk.com) and is licensed under the Apache 2 [license](LICENSE).







