Metadata-Version: 2.1
Name: pycantonese
Version: 3.1.0.dev2
Summary: PyCantonese: Cantonese Linguistics and NLP in Python
Home-page: https://pycantonese.org
Author: Jackson L. Lee
Author-email: jacksonlunlee@gmail.com
License: MIT License
Download-URL: https://pypi.org/project/pycantonese/#files
Project-URL: Bug Tracker, https://github.com/jacksonllee/pycantonese/issues
Project-URL: Source Code, https://github.com/jacksonllee/pycantonese
Description: PyCantonese: Cantonese Linguistics and NLP in Python
        ====================================================
        
        
        
        Full Documentation: https://pycantonese.org
        
        |
        
        .. image:: https://badge.fury.io/py/pycantonese.svg
           :target: https://pypi.python.org/pypi/pycantonese
           :alt: PyPI version
        
        .. image:: https://img.shields.io/pypi/pyversions/pycantonese.svg
           :target: https://pypi.python.org/pypi/pycantonese
           :alt: Supported Python versions
        
        .. image:: https://circleci.com/gh/jacksonllee/pycantonese/tree/master.svg?style=svg
           :target: https://circleci.com/gh/jacksonllee/pycantonese/tree/master
           :alt: Build
        
        |
        
        .. start-sphinx-website-index-page
        
        PyCantonese is a Python library for Cantonese linguistics and natural language
        processing (NLP).
        The goal is to provide general-purpose tools to work with Cantonese language data:
        
        - Accessing and searching corpus data
        - Parsing and conversion tools for Jyutping romanization
        - Stop words
        - Word segmentation
        - Part-of-speech tagging
        
        Quick Examples
        --------------
        
        With PyCantonese imported:
        
        .. code-block:: python
        
            >>> import pycantonese as pc
        
        1. Word segmentation
        
        .. code-block:: python
        
            >>> pc.segment("廣東話好難學？")  # Is Cantonese difficult to learn?
            ['廣東話', '好', '難', '學', '？']
        
        2. Conversion from Cantonese characters to Jyutping
        
        .. code-block:: python
        
            >>> pc.characters_to_jyutping('香港人講廣東話')  # Hongkongers speak Cantonese
            [("香港人", "hoeng1gong2jan4"), ("講", "gong2"), ("廣東話", "gwong2dung1waa2")]
        
        3. Finding all verbs in the HKCanCor corpus
        
           In this example,
           we search for the regular expression ``'^V'`` for all words whose
           part-of-speech tag begins with "V" in the original HKCanCor annotations:
        
        .. code-block:: python
        
            >>> corpus = pc.hkcancor() # get HKCanCor
            >>> all_verbs = corpus.search(pos='^V')
            >>> len(all_verbs)  # number of all verbs
            29012
            >>> from pprint import pprint
            >>> pprint(all_verbs[:10])  # print 10 results
            [('去', 'V', 'heoi3', ''),
             ('去', 'V', 'heoi3', ''),
             ('旅行', 'VN', 'leoi5hang4', ''),
             ('有冇', 'V1', 'jau5mou5', ''),
             ('要', 'VU', 'jiu3', ''),
             ('有得', 'VU', 'jau5dak1', ''),
             ('冇得', 'VU', 'mou5dak1', ''),
             ('去', 'V', 'heoi3', ''),
             ('係', 'V', 'hai6', ''),
             ('係', 'V', 'hai6', '')]
        
        4. Parsing Jyutping for (onset, nucleus, coda, tone)
        
        .. code-block:: python
        
            >>> pc.parse_jyutping('gwong2dung1waa2')  # 廣東話
            [('gw', 'o', 'ng', '2'), ('d', 'u', 'ng', '1'), ('w', 'aa', '', '2')]
        
        Download and Install
        --------------------
        
        PyCantonese requires Python 3.6 or above.
        To download and install the stable, most recent version::
        
            $ pip install --upgrade pycantonese
        
        For bug fixes and new features not yet available through a released version
        (they are documented under the "Unreleased" section of the changelog),
        you can get this (possibly unstable, still in development) version of PyCantonese
        by installing directly from the source code hosted on GitHub:
        
        1. If you haven't done so already, install `Git LFS <https://git-lfs.github.com/>`_
           on your system. You only have to do this step once per system.
           Git LFS is to enable the proper fetching of model files stored differently
           due to its file size and/or binary nature.
        
        2. Download and install PyCantonese from the GitHub source:
        
           .. code-block:: bash
        
               $ pip install git+https://github.com/jacksonllee/pycantonese.git@master#egg=pycantonese
        
        To test your installation in the Python interpreter:
        
        .. code-block:: python
        
            >>> import pycantonese as pc
            >>> pc.__version__  # show version number
        
        Links
        -----
        
        * Source code: https://github.com/jacksonllee/pycantonese
        * Bug tracker, feature requests: https://github.com/jacksonllee/pycantonese/issues
        * Email: Please contact `Jackson Lee <https://jacksonllee.com>`_.
        * Social media: Updates, tips, and more are posted on the Facebook page below.
        
        
        
        |
        
        How to Cite
        -----------
        
        PyCantonese is authored and mainteined by `Jackson L. Lee <https://jacksonllee.com>`_.
        
        A talk introducing PyCantonese:
        
        Lee, Jackson L. 2015. PyCantonese: Cantonese linguistic research in the age of big data.
        Talk at the Childhood Bilingualism Research Centre, Chinese University of Hong Kong. September 15. 2015.
        `Notes+slides <https://pycantonese.org/papers/Lee-pycantonese-2015.html>`_
        
        License
        -------
        
        MIT License. Please see ``LICENSE.txt`` in the GitHub source code for details.
        
        The HKCanCor dataset included in PyCantonese is substantially modified from
        its source in terms of format. The original dataset has a CC BY license.
        Please see ``pycantonese/data/hkcancor/README.md``
        in the GitHub source code for details.
        
        The rime-cantonese data (release 2020.09.09) is
        incorporated into PyCantonese for word segmentation and
        characters-to-Jyutping conversion.
        This data has a CC BY 4.0 license.
        Please see ``pycantonese/data/rime_cantonese/README.md``
        in the GitHub source code for details.
        
        Acknowledgments
        ---------------
        
        Individuals who have contributed feedback, bug reports, etc.
        (in alphabetical order of last names if known):
        
        - @cathug
        - Litong Chen
        - @g-traveller
        - Rachel Han
        - Ryan Lai
        - Charles Lam
        - Hill Ma
        - @richielo
        - @rylanchiu
        - Stephan Stiller
        - Tsz-Him Tsui
        
        Logo design by albino.snowman (Instagram handle).
        
        .. end-sphinx-website-index-page
        
        Changelog
        ---------
        
        Please see ``CHANGELOG.md``.
        
        Setting up a Development Environment
        ------------------------------------
        
        The latest code under development is available on Github at
        `jacksonllee/pycantonese <https://github.com/jacksonllee/pycantonese>`_.
        You need to have `Git LFS <https://git-lfs.github.com/>`_ installed on your system.
        To obtain this version for experimental features or for development:
        
        .. code-block:: bash
        
           $ git clone https://github.com/jacksonllee/pycantonese.git
           $ cd pycantonese
           $ pip install -r requirements.txt
           $ pip install -e .
        
        To run tests and styling checks:
        
        .. code-block:: bash
        
           $ py.test -vv --cov pycantonese pycantonese
           $ flake8 pycantonese
           $ black --check --line-length=79 pycantonese
        
        To build the documentation website files:
        
        .. code-block:: bash
        
            $ python build_docs.py
Keywords: computational linguistics,natural language processing,NLP,Cantonese,linguistics,corpora,speech,language,Chinese,Jyutping
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: Chinese (Traditional)
Classifier: Natural Language :: Cantonese
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Human Machine Interfaces
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Filters
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Text Processing :: Indexing
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.6
Description-Content-Type: text/x-rst
