# -*- coding: utf-8 -*-
from setuptools import setup

packages = \
['tftokenizers']

package_data = \
{'': ['*']}

install_requires = \
['Sphinx==4.1.2',
 'datasets>=1.17.0,<2.0.0',
 'myst-parser==0.15.2',
 'pydantic>=1.9.0,<2.0.0',
 'python-decouple>=3.5,<4.0',
 'readthedocs-sphinx-search==0.1.1',
 'recommonmark>=0.7.1,<0.8.0',
 'requests==2.26.0',
 'rich[jupyter]>=10.14.0,<11.0.0',
 'sentencepiece>=0.1.96,<0.2.0',
 'sphinx-copybutton==0.4.0',
 'sphinx-markdown-tables==0.0.15',
 'sphinx-rtd-theme==1.0.0',
 'sphinxemoji>=0.2.0,<0.3.0',
 'sphinxext-opengraph==0.4.2',
 'tensorflow-datasets>=4.4.0,<5.0.0',
 'tensorflow-hub>=0.9.0,<0.10.0',
 'tensorflow-text==2.5.0',
 'tensorflow==2.5.2',
 'tf-sentencepiece>=0.1.92,<0.2.0',
 'tomlkit==0.7.2',
 'torch>=1.10.1,<2.0.0',
 'transformers>=4.15.0,<5.0.0',
 'unzip>=1.0.0,<2.0.0',
 'wget>=3.2,<4.0']

setup_kwargs = {
    'name': 'tftokenizers',
    'version': '0.1.5',
    'description': 'Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels.',
    'long_description': '# TFtftransformers\n\nConverting Hugginface tokenizers to Tensorflow tokenizers. The main reason is to be able to bundle the tokenizer and model into one Reusable SavedModel, inspired by the [Tensorflow Official Guide on tokenizers](hhttps://www.tensorflow.org/text/guide/bert_preprocessing_guide)\n\n## <a href="https://badge.fury.io/py/tftokenizers"><img src="https://badge.fury.io/py/tftokenizers.svg" alt="PyPI version" height="18"></a>\n\n**Source Code**: <a href="https://github.com/Huggingface-Supporters/tftftransformers" target="_blank">https://github.com/Hugging-Face-Supporter/tftokenizers</a>\n\n---\n\nModels we know works:\n\n```python\n"bert-base-cased"\n"bert-base-uncased"\n"bert-base-multilingual-cased"\n"bert-base-multilingual-uncased"\n# Distilled\n"distilbert-base-cased"\n"distilbert-base-multilingual-cased"\n"microsoft/MiniLM-L12-H384-uncased"\n# Non-english\n"KB/bert-base-swedish-cased"\n"bert-base-chinese"\n```\n\n## Examples\n\nThis is an example of how one can use Huggingface model and tokenizers bundled together as a [Reusable SavedModel](https://www.tensorflow.org/hub/reusable_saved_models) and yields the same result as using the model and tokenizer from Huggingface 🤗\n\n```python\nimport tensorflow as tf\nfrom transformers import TFAutoModel\nfrom tftokenizers import TFModel, TFAutoTokenizer\n\n# Load base models from Huggingface\nmodel_name = "bert-base-cased"\nmodel = TFAutoModel.from_pretrained(model_name)\n\n# Load converted TF tokenizer\ntokenizer = TFAutoTokenizer.from_pretrained(model_name)\n\n# Create a TF Reusable SavedModel\ncustom_model = TFModel(model=model, tokenizer=tokenizer)\n\n# Tokenizer and model can handle `tf.Tensors` or regular strings\ntf_string = tf.constant(["Hello from Tensorflow"])\ns1 = "SponGE bob SQuarePants is an avenger"\ns2 = "Huggingface to Tensorflow tokenizers"\ns3 = "Hello, world!"\n\noutput = custom_model(tf_string)\noutput = custom_model([s1, s2, s3])\n\n# We can now pass input as tensors\noutput = custom_model(\n    inputs=tf.constant([s1, s2, s3], dtype=tf.string, name="inputs"),\n)\n\n# Save tokenizer\nsaved_name = "reusable_bert_tf"\ntf.saved_model.save(custom_model, saved_name)\n\n# Load tokenizer\nreloaded_model = tf.saved_model.load(saved_name)\noutput = reloaded_model([s1, s2, s3])\nprint(output)\n```\n\n## `Setup`\n\n```bash\ngit clone https://github.com/Hugging-Face-Supporter/tftokenizers.git\ncd tftokenizers\npoetry install\npoetry shell\n```\n\n## `Run`\n\nTo convert a Huggingface tokenizer to Tensorflow, first choose one from the models or tokenizers from the Huggingface hub to download.\n\n**NOTE**\n\n> Currently only BERT models work with the converter.\n\n### `Download`\n\nFirst download tokenizers from the hub by name. Either run the bash script do download multiple tokenizers or download a single tokenizer with the python script.\n\nThe idea is to eventually only to automatically download and convert\n\n```bash\npython tftokenizers/download.py -n bert-base-uncased\nbash scripts/download_tokenizers.sh\n```\n\n### `Convert`\n\nConvert downloaded tokenizer from Huggingface format to Tensorflow\n\n```bash\npython tftokenizers/convert.py\n```\n\n## `Before Commit`\n\n```bash\nmake build\n```\n\n## FAQ\n\n### How to know what tokenizer is used?\n**TL;DR**\n```python\nfrom transformers import AutoTokenizer\n\nname = "bert-base-cased"\ntokenizer = AutoTokenizer.from_pretrained(name)\n\n# IF the tokenizer is fast:\nprint(tokenizer.is_fast)\n# Base tokenizer model\nprint(type(tokenizer.backend_tokenizer.model))\n# Check if it is a SentencePiece tokenizer\n# Should be `vocab.txt` or `vocab.json` if not SentencePiece tokenizer\n# SencePiece if "vocab_file":\n#   "sentencepiece.bpe.model"\nprint(tokenizer.vocab_files_names)\n\n# Else\n# Find if the model is a SentencePiece model with\nprint(vars(tokenizer).get("spm_file", None))\n# print(vars(tokenizer).get("sp_model", None))\n```\n\n<details>\n<summary>:memo: Read More:</summary>\nAnd the components of the tokenizers described [here](https://huggingface.co/docs/tokenizers/python/latest/components.html) as:\n- Normalizers\n- Pre tokenizers\n- [Models](https://huggingface.co/docs/tokenizers/python/latest/components.html#models)\n- PostProcessor\n- Decoders\n\n\nWhen loading a tokenizer with Huggingface transformers, it maps the name of the model from the Huggingface Hub to the correct model and tokenizer available there, if not it will try to to find a folder on your local computer with that name.\n\nAdditionally, tokenizers from Huggingface are defined in multiple different steps using the Huggingface tokenizer library. For those interested, you can look into the different components of that library of how the composition of a tokenizer works [here](https://huggingface.co/docs/tokenizers/python/latest/). There is also a great guide documenting how composition of tokenizers are done in this [Medium article](https://towardsdatascience.com/designing-tokenizers-for-low-resource-languages-7faa4ab30ef4)\n</details>\n\n### What tokenizers are used by what models?\n<details>\n<summary>:memo: Read More:</summary>\nAs stated in the section above, you will need to look at each model to inspect the type of tokenizer it is using, but in general there are just a few "base tokenizers / models". See [Huggingface documentation](https://huggingface.co/docs/transformers/tokenizer_summary) for explanation on how these "base tokenizers" are defined\n\n[Base Tokenizer Names](https://github.com/huggingface/tokenizers/blob/master/bindings/python/py_src/tokenizers/models/__init__.py)\n[Model Implementations](https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations)\n\nSentencePiece tokenizers can either be BPE (rare if the tokenizers is fast) or Unigram (all Unigram == SentencePiece)\n#### BPE = tokenizers.models.BPE\n- Implemented by\n\n    [byte-pair BPE](https://github.com/huggingface/tokenizers/blob/master/bindings/python/py_src/tokenizers/implementations/byte_level_bpe.py), [char-level BPE](https://github.com/huggingface/tokenizers/blob/master/bindings/python/py_src/tokenizers/implementations/char_level_bpe.py), ([SentencePiece BPE](https://github.com/huggingface/tokenizers/blob/master/bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py))\n\n- Used by\n\n    `GPT`, `XLNet`, `FlauBERT`, `RoBERTa`, `GPT-2`, `GPT-j`, `GPT-neo`, `BART`, `XLM-RoBERTa`\n#### Unigram = tokenizers.models.Unigram\n- Implemented by\n\n    [SentencePiece Unicode](https://github.com/huggingface/tokenizers/blob/master/bindings/python/py_src/tokenizers/implementations/sentencepiece_unigram.py)\n\n- Used by\n\n    All `T5` models\n#### WordPiece = tokenizers.models.WordPiece\n- Implemented by\n\n    [Bert WordPiece](https://github.com/huggingface/tokenizers/blob/master/bindings/python/py_src/tokenizers/implementations/bert_wordpiece.py)\n\n- Used by\n\n    `BERT`, `mBERT`, `miniLM`, distilled versions of BERT\n\n#### SentencePiece\nSentencePiece is a method for creating sub-word tokenizations.\nIt supports BPE and Unigram.\n\nSentencePiece is a separate C++ implemented library with python and Tensorflow bindings.\nThe vocabulary is bundled into:\n\n**For fast models**:\n\n"vocab_file_names":\n\n    `sentencepiece.bpe.model` for "BPE" and\n    `spiece.model` for Unigram\n\n**For slow models**:\n\n"vocab_file_names":\n\n    \'source_spm\': \'source.spm\',\n    \'target_spm\': \'target.spm\',\n    \'vocab\': \'vocab.json\'\n\n"spm_files":\n\n    will be a single file or a list of files\n    ...\n\n- Used by:\n\n    **Fast**: `T5` models\n    **Slow**: `facebook/m2m100_418M`, `facebook/wmt19-en-de`\n</details>\n\n### How to implement the tokenizers from Huggingface to Tensorflow?\nYou will need to download the Huggingface tokenizer of your choice, determine the type of the tokenizer (`is_fast`, tokenizer type and `vocab_file_names`). Then map the tokenizer used to the Tensorflow supported equivalent:\n\nhttps://github.com/tensorflow/text/issues/422\n\n**BPE** and **Unigram**:\n- All BPE implementations for Tensorflow is backed by SentencePiece\n- [SentencePiece in TensorFlow](https://www.tensorflow.org/text/api_docs/python/text/SentencepieceTokenizer)\n- [Official Answer 1](https://github.com/tensorflow/text/issues/415)\n- [Official Answer 2](https://github.com/tensorflow/text/issues/763)\n- [How to load a SentencePiece model](https://github.com/tensorflow/text/issues/215)\n- [Input will need to be Tensors](https://github.com/tensorflow/text/issues/512)\n- [How to load model from vocab](https://github.com/tensorflow/text/issues/452)\n\n**WordPiece**:\n- [BertTokenizer](https://www.tensorflow.org/text/api_docs/python/text/BertTokenizer) or\n- [WordPiece](https://www.tensorflow.org/text/api_docs/python/text/FastWordpieceTokenizer) or\n- [FastWordPiece](https://www.tensorflow.org/text/api_docs/python/text/FastWordpieceTokenizer)\n\n\nhttps://github.com/tensorflow/text/issues/116\nhttps://github.com/tensorflow/text/issues/414\n\n### What other ways are there to convert a tokenizer?\n<details>\n<summary>:memo: Read More:</summary>\nWith `tfokenizers` there are three ways to use the package:\n\n```python\nimport tensorflow as tf\nimport tensorflow_text as text\nfrom transformers import AutoTokenizer, TFAutoModel\nfrom transformers.utils.logging import set_verbosity_error\n\nfrom tftokenizers.file import (\n    get_filename_from_path,\n    get_vocab_from_path,\n    load_json\n)\nfrom tftokenizers.model import TFModel\nfrom tftokenizers.tokenizer import TFAutoTokenizer, TFTokenizerBase\n\nset_verbosity_error()\ntf.get_logger().setLevel("ERROR")\n\npretrained_model_name = "bert-base-cased"\n\n\n# a) by model_name\ntf_tokenizer = TFAutoTokenizer.from_pretrained(pretrained_model_name)\n\n# b) bundled with the model, similar to TFHub\nmodel = TFAutoModel.from_pretrained(pretrained_model_name)\ncustom_model = TFModel(model=model, tokenizer=tf_tokenizer)\n\n# c) from source, using the saved files of a transformers tokenizer\n# Make sure you run download.py or the download script first\nPATH = "saved_tokenizers/bert-base-uncased"\nvocab = get_vocab_from_path(PATH)\nvocab_path = get_filename_from_path(PATH, "vocab")\n\nconfig = load_json(f"{PATH}/tokenizer_config.json")\ntokenizer_spec = load_json(f"{PATH}/tokenizer.json")\nspecial_tokens_map = load_json(f"{PATH}/special_tokens_map.json")\n\ntokenizer_base_params = dict(lower_case=True, token_out_type=tf.int64)\ntokenizer_base = text.BertTokenizer(vocab_path, **tokenizer_base_params)\ncustom_tokenizer = TFTokenizerBase(\n    vocab_path=vocab_path,\n    tokenizer_base=tokenizer_base,\n    hf_spec=tokenizer_spec,\n    config=config,\n)\n```\n</details>\n\n\n### How to save Huggingface Tokenizer files locally?\n<details>\n<summary>:memo: Read More:</summary>\n\nTo download the files used by Huggingface tokenizers, you can either download one by name\n```\npython tftokenizers/download.py -n KB/bert-base-swedish-cased\n```\nor download multiple\n```\nbash scrips/download_tokenizers.sh\n```\n</details>\n\n## WIP\n\n- [x] Convert a BERT tokenizer from Huggingface to Tensorflow\n- [x] Make a TF Reusabel SavedModel with Tokenizer and Model in the same class. Emulate how the TF Hub example for BERT works.\n- [x] Find methods for identifying the base tokenizer model and map those settings and special tokens to new tokenizers\n- [x] Extend the tokenizers to more tokenizer types and identify them from a huggingface model name\n- [x] Document how others can use the library and document the different stages in the process\n- [x] Improve the conversion pipeline (s.a. Download and export files if not passed in or available locally)\n- [ ] `model_max_length` should be regulated. However, some newer models have the max_lenght for tokenizers at 1000_000_000\n- [ ] Support more tokenizers, starting with SentencePiece\n- [ ] Identify tokenizer conversion limitations\n- [ ] Support encoding of two sentences at a time [Ref](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)\n- [ ] Allow the tokenizers to be used for Masking (MLM) [Ref](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)\n',
    'author': 'MarkusSagen',
    'author_email': 'markus.john.sagen@gmail.com',
    'maintainer': None,
    'maintainer_email': None,
    'url': 'https://github.com/Hugging-Face-Supporter/tftokenizers',
    'packages': packages,
    'package_data': package_data,
    'install_requires': install_requires,
    'python_requires': '>=3.8,<4.0',
}


setup(**setup_kwargs)
