Metadata-Version: 2.1
Name: hmni
Version: 0.1.8
Summary: Fuzzy Name Matching with Machine Learning
Home-page: https://github.com/Christopher-Thornton/hmni
Author: Christopher Thornton
Author-email: christopher_thornton@outlook.com
License: MIT
Download-URL: https://github.com/Christopher-Thornton/hmni/archive/v0.1.8.zip
Description: <p align="center">
          <img src="https://github.com/Christopher-Thornton/hmni/blob/master/nametag.png?raw=true" alt="logo" />
        </p>
        
        # HMNI
        
        ![GitHub](https://img.shields.io/github/license/Christopher-Thornton/hmni)
        ![PyPI](https://img.shields.io/pypi/v/hmni)
        ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/hmni)
        [![Documentation Status](https://readthedocs.org/projects/hmni/badge/?version=latest)](https://hmni.readthedocs.io/en/latest/?badge=latest)
        ![PyPI - Downloads](https://img.shields.io/pypi/dm/hmni)
        ![GitHub repo size](https://img.shields.io/github/repo-size/Christopher-Thornton/hmni)
        
        Fuzzy name matching with machine learning. Perform common fuzzy name matching tasks including similarity scoring, record linkage, deduplication and normalization.
        
        HMNI is trained on an internationally-transliterated Latin firstname dataset, where precision is afforded priority.
        
        |    Model    |  Accuracy | Precision |   Recall  |  F1-Score 
        |-------------|-----------|-----------|-----------|-----------
        | HMNI-Latin  | 0.9393    | 0.9255    | 0.7548    | 0.8315    
        
        For an introduction to the methodology and research behind HMNI, please refer to my [blog post](https://towardsdatascience.com/fuzzy-name-matching-with-machine-learning-f09895dce7b4).
        
        ## Requirements
        ### Python 3.5–3.8
        -  tensorflow
        -  scikit-learn
        -  fuzzywuzzy
        -  abydos
        -  unidecode
        
        ## QUICK USAGE GUIDE
        ## Installation
        Using PIP via PyPI
        ```bash
        pip install hmni
        ```
        #### Initialize a Matcher Object
        ```python
        import hmni
        matcher = hmni.Matcher(model='latin')
        ```
        #### Single Pair Similarity
        ```python
        matcher.similarity('Alan', 'Al')
        # 0.6838303319889133
        
        matcher.similarity('Alan', 'Al', prob=False)
        # 1
        
        matcher.similarity('Alan Turing', 'Al Turing', surname_first=False)
        # 0.6838303319889133
        ```
        #### Record Linkage
        ```python
        import pandas as pd
        
        df1 = pd.DataFrame({'name': ['Al', 'Mark', 'James', 'Harold']})
        df2 = pd.DataFrame({'name': ['Mark', 'Alan', 'James', 'Harold']})
        
        merged = matcher.fuzzymerge(df1, df2, how='left', on='name')
        ```
        #### Name Deduplication and Normalization
        ```python
        names_list = ['Alan', 'Al', 'Al', 'James']
        
        matcher.dedupe(names_list, keep='longest')
        # ['Alan', 'James']
        
        matcher.dedupe(names_list, keep='frequent')
        # ['Al, 'James']
        
        matcher.dedupe(names_list, keep='longest', replace=True)
        # ['Alan, 'Alan', 'Alan', 'James']
        ```
        ## Matcher Parameters
        > **hmni.Matcher**(model='latin', prefilter=True, allow_alt_surname=True, allow_initials=True, allow_missing_components=True)
        * **model** *(str)* -- HMNI statistical model (latin by default)
        * **prefilter** *(bool)* -- Should the matcher prefilter unlikely candidates (True by default)
        * **allow_alt_surname** *(bool)* -- Should the matcher consider phonetic matching surnames *e.g. Smith, Schmidt* (True by default)
        * **allow_initials** *(bool)* -- Should the matcher consider names with initials (True by default)
        * **allow_missing_components** *(bool)* -- Should the matcher consider names with missing components (True by default)
        
        ## Matcher Methods
        > **similarity**(name_a, name_b, prob=True, surname_first=False)
        * **name_a** *(str)* -- First name for comparison
        * **name_b** *(str)* -- Second name for comparison
        * **prob** *(bool)* -- If True return a predicted probability, else binary class label
        * **threshold** *(float)* -- Prediction probability threshold for positive match (0.5 by default)
        * **surname_first** *(bool)* -- If name strings start with surname (False by default)
        
        > **fuzzymerge**(df1, df2, how='inner', on=None, left_on=None, right_on=None, indicator=False, limit=1, threshold=0.5, allow_exact_matches=True, surname_first=False)
        * **df1** *(pandas DataFrame or named Series)* -- First/Left object to merge with
        * **df2** *(pandas DataFrame or named Series)* -- Second/Right object to merge with
        * **how** *(str)* -- Type of merge to be performed
            * `inner` (default): Use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys
            * `left`: Use only keys from left frame, similar to a SQL left outer join; preserve key order
            * `right`: Use only keys from right frame, similar to a SQL right outer join; preserve key order
            * `outer`: Use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
        * **on** *(label or list)* -- Column or index level names to join on. These must be found in both DataFrames
        * **left_on** *(label or list)* -- Column or index level names to join on in the left DataFrame
        * **right_on** *(label or list)* -- Column or index level names to join on in the right DataFrame
        * **indicator** *(bool)* -- If True, adds a column to output DataFrame called “_merge” with information on the source of each row (False by default)
        * **limit** *(int)* -- Top number of name matches to consider (1 by default)     
        * **threshold** *(float)* -- Prediction probability threshold for positive match (0.5 by default)       
        * **allow_exact_matches** *(bool)* -- If True allow merging on exact name matches, else do not consider exact matches (True by default)
        * **surname_first** *(bool)* -- If name strings start with surname (False by default)
        
        > **dedupe**(names, threshold=0.5, keep='longest', reverse=True, limit=3, replace=False, surname_first=False)
        * **names** *(list)* -- List of names to dedupe
        * **threshold** *(float)* -- Prediction probability threshold for positive match (0.5 by default)
        * **keep** *(str)* -- Specifies method for keeping one of multiple alternative names 
            * `longest` (default): Keeps longest name
            * `frequent`: Keeps most frequent name in names list
        * **reverse** *(bool)* -- If True will sort matches descending order, else ascending (True by default)
        * **limit** *(int)* -- Top number of name matches to consider (3 by default)
        * **replace** *(bool)* -- If True return normalized name list, else return deduplicated name list (False by default) 
        * **surname_first** *(bool)* -- If name strings start with surname (False by default)
        
        > **assign_similarity**(name_a, name_b, score)
        * **name_a** *(str)* -- First name for similarity score assignment
        * **name_b** *(str)* -- Second name for similarity score assignment
        * **score** *(float)* -- Assigned similarity score for pair of names
        
        ## Contributing
        Pull requests are welcome. 
        For developers wishing to build a model using Latin or non-Latin writing systems (Chinese, Cyrillic, Arabic), 
        jupyter notebooks are shared in the `dev` folder to build models using similar methods. 
        
        ## License
        [MIT](https://choosealicense.com/licenses/mit/)
        
Keywords: fuzzy-matching,natural-language-processing,nlp,machine-learning,data-science,python,artificial-intelligence,ai
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Description-Content-Type: text/markdown
