Metadata-Version: 1.1
Name: neologdn
Version: 0.5.1
Summary: Japanese text normalizer for mecab-neologd
Home-page: http://github.com/ikegami-yukino/neologdn
Author: Yukino Ikegami
Author-email: yknikgm@gmail.com
License: Apache Software License
Description: neologdn
        ===========
        
        |travis| |pyversion| |version| |license|
        
        neologdn is a Japanese text normalizer for `mecab-neologd <https://github.com/neologd/mecab-ipadic-neologd>`_.
        
        The normalization is based on the neologd's rules:
        https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja
        
        
        Contributions are welcome!
        
        NOTE: Installing this module requires C++11 compiler.
        
        Installation
        ------------
        
        ::
        
         $ pip install neologdn
        
        Usage
        -----
        
        .. code:: python
        
            import neologdn
            neologdn.normalize("ﾊﾝｶｸｶﾅ")
            # => 'ハンカクカナ'
            neologdn.normalize("全角記号！？＠＃")
            # => '全角記号!?@#'
            neologdn.normalize("全角記号例外「・」")
            # => '全角記号例外「・」'
            neologdn.normalize("長音短縮ウェーーーーイ")
            # => '長音短縮ウェーイ'
            neologdn.normalize("チルダ削除ウェ~∼∾〜〰～イ")
            # => 'チルダ削除ウェイ'
            neologdn.normalize("いろんなハイフン˗֊‐‑‒–⁃⁻₋−")
            # => 'いろんなハイフン-'
            neologdn.normalize("　　　ＰＲＭＬ　　副　読　本　　　")
            # => 'PRML副読本'
            neologdn.normalize(" Natural Language Processing ")
            # => 'Natural Language Processing'
            neologdn.normalize("かわいいいいいいいいい", repeat=6)
            # => 'かわいいいいいい'
            neologdn.normalize("無駄無駄無駄無駄ァ", repeat=1)
            # => '無駄ァ'
            neologdn.normalize("1995〜2001年", tilde="normalize")
            # => '1995~2001年'
            neologdn.normalize("1995~2001年", tilde="normalize_zenkaku")
            # => '1995〜2001年'
            neologdn.normalize("1995〜2001年", tilde="ignore")  # Don't convert tilde
            # => '1995〜2001年'
            neologdn.normalize("1995〜2001年", tilde="remove")
            # => '19952001年'
            neologdn.normalize("1995〜2001年")  # Default parameter
            # => '19952001年'
        
        
        Benchmark
        ----------
        
        .. code:: python
        
            # Sample code from
            # https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast
            import normalize_neologd
        
            %timeit normalize(normalize_neologd.normalize_neologd)
            # => 1 loop, best of 3: 18.3 s per loop
        
        
            import neologdn
            %timeit normalize(neologdn.normalize)
            # => 1 loop, best of 3: 9.05 s per loop
        
        
        neologdn is about x2 faster than sample code.
        
        details are described as the below notebook:
        https://github.com/ikegami-yukino/neologdn/blob/master/benchmark/benchmark.ipynb
        
        
        License
        -------
        
        Apache Software License.
        
        
        Contribution
        ------------
        
        Contributions are welcome! See: https://github.com/ikegami-yukino/neologdn/blob/master/.github/CONTRIBUTING.md
        
        
        .. |travis| image:: https://travis-ci.org/ikegami-yukino/neologdn.svg?branch=master
            :target: https://travis-ci.org/ikegami-yukino/neologdn
            :alt: travis-ci.org
        
        .. |version| image:: https://img.shields.io/pypi/v/neologdn.svg
            :target: http://pypi.python.org/pypi/neologdn/
            :alt: latest version
        
        .. |pyversion| image:: https://img.shields.io/pypi/pyversions/neologdn.svg
        
        .. |license| image:: https://img.shields.io/pypi/l/neologdn.svg
            :target: http://pypi.python.org/pypi/neologdn/
            :alt: license
        
        
        
        CHANGES
        ========
        
        0.5.1 (2021-05-02)
        ----------------------------
        
        - Improve performance of shorten_repeat function (Many thanks @yskn67)
        - Add tilde option to normalize function
        
        0.4 (2018-12-06)
        ----------------------------
        
        - Add shorten_repeat function, which shortening contiguous substring. For example: neologdn.normalize("無駄無駄無駄無駄ァ", repeat=1) -> 無駄ァ
        
        0.3.2 (2018-05-17)
        ----------------------------
        
        - Add option for suppression removal of spaces between Japanese characters
        
        0.2.2 (2018-03-10)
        ----------------------------
        
        - Fix bug (daku-ten & handaku-ten)
        - Support mac osx 10.13 (Many thanks @r9y9)
        
        0.2.1 (2017-01-23)
        ----------------------------
        
        - Fix bug (Check if a previous character of daku-ten character is in maps) (Many thanks @unnonouno)
        
        0.2 (2016-04-12)
        ----------------------------
        
        - Add lengthened expression (repeating character) threshold
        
        0.1.2 (2016-03-29)
        ----------------------------
        
        - Fix installation bug
        
        0.1.1.1 (2016-03-19)
        ----------------------------
        
        - Support Windows
        - Explicitly specify to -std=c++11 in build (Many thanks @id774)
        
        0.1.1 (2015-10-10)
        ----------------------------
        
        Initial release.
        
Keywords: japanese,MeCab
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: Japanese
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Cython
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Text Processing :: Linguistic
