Metadata-Version: 2.1
Name: gruut
Version: 1.3.1
Summary: A tokenizer, text cleaner, and phonemizer for many human languages.
Home-page: https://github.com/rhasspy/gruut
Author: Michael Hansen
Author-email: mike@rhasspy.org
License: UNKNOWN
Description: # Gruut
        
        A tokenizer, text cleaner, and [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) phonemizer for several human languages.
        
        ```python
        from gruut import text_to_phonemes
        
        text = 'He wound it around the wound, saying "I read it was $10 to read."'
        
        for sent_idx, word, word_phonemes in text_to_phonemes(text, lang="en-us"):
            print(word, *word_phonemes)
        ```
        
        which outputs:
        
        ```
        he h ˈi
        wound w ˈaʊ n d
        it ˈɪ t
        around ɚ ˈaʊ n d
        the ð ə
        wound w ˈu n d
        , |
        saying s ˈeɪ ɪ ŋ
        i ˈaɪ
        read ɹ ˈɛ d
        it ˈɪ t
        was w ə z
        ten t ˈɛ n
        dollars d ˈɑ l ɚ z
        to t ə
        read ɹ ˈi d
        . ‖
        ```
        
        Note that "wound" and "read" have different pronunciations when used in different contexts.
        
        See [the documentation](https://rhasspy.github.io/gruut/) for more details.
        
        ## Installation
        
        ```sh
        $ pip install gruut
        ```
        
        Additional languages can be added during installation. For example, with French and Italian support:
        
        ```sh
        $ pip install gruut[fr,it]
        ```
        
        You may also [manually download language files](https://github.com/rhasspy/gruut/releases/tag/v1.0.0) and use the `--lang-dir` option:
        
        ```sh
        $ gruut <lang> <command> --lang-dir /path/to/language-files/
        ```
        
        Extracting the files to `$HOME/.config/gruut/` will allow gruut to automatically make use of them. gruut will look for language files in the directory `$HOME/.config/gruut/<lang>/` if the corresponding Python package is not installed. Note that `<lang>` here is the **full** language name, e.g. `de-de` instead of just `de`. 
        
        ## Supported Languages
        
        gruut currently supports:
        
        * Czech (`cs` or `cs-cz`)
        * German (`de` or `de-de`)
        * English (`en` or `en-us`)
        * Spanish (`es` or `es-es`)
        * Farsi/Persian (`fa`)
        * French (`fr` or `fr-fr`)
        * Italian (`it` or `it-it`)
        * Dutch (`nl`)
        * Russian (`ru` or `ru-ru`)
        * Swedish (`sv` or `sv-se`)
        
        The goal is to support all of [voice2json's languages](https://github.com/synesthesiam/voice2json-profiles#supported-languages)
        
        ## Dependencies
        
        * Python 3.6 or higher
        * Linux
            * Tested on Debian Buster
        * [num2words fork](https://github.com/rhasspy/num2words) and [Babel](https://pypi.org/project/Babel/)
            * Currency/number handling
            * num2words fork includes additional language support (Arabic, Farsi, Swedish, Swahili)
        * gruut-ipa
            * [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) pronunciation manipulation
        * [pycrfsuite](https://github.com/scrapinghub/python-crfsuite)
            * Part of speech tagging and grapheme to phoneme models
        
        
        ## Command-Line Usage
        
        The `gruut` module can be executed with `python3 -m gruut <LANGUAGE> <COMMAND> <ARGS>`
        
        The commands are line-oriented, consuming/producing either text or [JSONL](https://jsonlines.org/).
        They can be composed to produce a pipeline for cleaning text.
        
        You will probably want to install [jq](https://stedolan.github.io/jq/) to manipulate the [JSONL](https://jsonlines.org/) output from `gruut`.
        
        ### tokenize
        
        Takes raw text and outputs [JSONL](https://jsonlines.org/) with cleaned words/tokens.
        
        ```sh
        $ echo 'This, right here, is some RAW text!' \
            | python3 -m gruut en-us tokenize \
            | jq -c .clean_words
        ["this", ",", "right", "here", ",", "is", "some", "raw", "text", "!"]
        ```
        
        See `python3 -m gruut <LANGUAGE> tokenize --help` for more options.
        
        ### phonemize
        
        Takes [JSONL](https://jsonlines.org/) output from `tokenize` and produces [JSONL](https://jsonlines.org/) with phonemic pronunciations.
        
        ```sh
        $ echo 'This, right here, is some RAW text!' \
            | python3 -m gruut en-us tokenize \
            | python3 -m gruut en-us phonemize \
            | jq -c .pronunciation_text
        ð ɪ s | ɹ aɪ t h iː ɹ | ɪ z s ʌ m ɹ ɑː t ɛ k s t ‖
        ```
        
        See `python3 -m gruut <LANGUAGE> phonemize --help` for more options.
        
        ## Intended Audience
        
        gruut is useful for transforming raw text into phonetic pronunciations, similar to [phonemizer](https://github.com/bootphon/phonemizer). Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from a [carefully chosen inventory](https://en.wikipedia.org/wiki/Template:Language_phonologies).
        
        For each supported language, gruut includes a:
        
        * A word pronunciation lexicon built from open source data
            * See [pron_dict](https://github.com/Kyubyong/pron_dictionaries)
        * A pre-trained grapheme-to-phoneme model for guessing word pronunciations
        
        Some languages also include:
        
        * A pre-trained part of speech tagger built from open source data:
            * See [universal dependencies](https://universaldependencies.org/)
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Provides-Extra: es
Provides-Extra: fa
Provides-Extra: sw
Provides-Extra: align
Provides-Extra: de
Provides-Extra: all
Provides-Extra: train
Provides-Extra: it
Provides-Extra: nl
Provides-Extra: pt
Provides-Extra: ar
Provides-Extra: ru
Provides-Extra: sv
Provides-Extra: fr
Provides-Extra: cs
