Metadata-Version: 2.1
Name: etl-toolbox
Version: 0.0.1
Summary: Useful ETL functions for Python
Home-page: https://github.com/brookcub/etl-toolbox
Author: Brooklyn Rose Ludlow
License: Apache Software License
Description: etl-toolbox
        ===========
        
           Useful ETL functions for Python
        
        **etl-toolbox** is a Python library of simple but powerful functions for ETL and
        data cleaning. It contains tools that are useful for nearly any ETL pipeline,
        with a specific focus on the data variety challenges that arise when compiling
        data from many sources.
        
        .. _PyPI: http://pypi.python.org/pypi/etl-toolbox/
        .. _pip: https://pip.pypa.io/en/stable/quickstart/
        .. _Open an issue: https://github.com/brookcub/etl-toolbox/issues/new
        .. _issue tracker: https://github.com/brookcub/etl-toolbox/issues
        .. _this repository: https://github.com/brookcub/etl-toolbox/
        .. _the Stack Overflow guide: https://stackoverflow.com/help/minimal-reproducible-example
        .. _PEP8: https://www.python.org/dev/peps/pep-0008/
        .. _Flake8: https://flake8.pycqa.org/en/latest/
        
        Features
        --------
        
        - Standardize various null-indicating values (``'blank'``, ``'none'``, ``'null'``, etc) to Python's ``None``
        - Trim, condense, and standardize whitespace with a single function
        - Locate and rename column labels in messy files
        
        Quick Start
        -----------
        
        Installation
        ~~~~~~~~~~~~
        
        Install from PyPI_ using pip_::
        
              $ pip install etl_toolbox
        
        Usage
        ~~~~~
        
        >>> import pandas as pd
        >>>
        >>> df = pd.read_csv('./test_data/bad-data.csv')
        >>> df  # doctest:+SKIP
                 Unnamed: 0           Unnamed: 1            Unnamed: 2    Unnamed: 3 Unnamed: 4
        0      created by:   Brookcub Industries  for testing purposes           NaN        NaN
        1             date:           2020-06-07                3 rows  some columns        NaN
        2               NaN                  NaN                   NaN           NaN        NaN
        3             Cust.             EML-addr                  \nOn      phn-nmbr       col5
        4     Golden jackal    c.aureus@mail.com              03/04/14      333-4444      blank
        5  Pie, rufous tree                 none                   NaN      222-3333      empty
        6   Vulture, bengal              blocked              06/01/15      777-7777       none
        7       Arctic tern  s_paradise@mail.com              01/28/16           NaN        NaN
        8   Eurasian badger  meles@othermail.net         notavailable            NaN        NaN
        9   Grant's gazelle   grant@randmail.com                     -           NaN        NaN
        
        Find and standardize column labels using a dictionary of the expected values:
        
        >>> from etl_toolbox.dataframe_functions import find_column_labels
        >>> from etl_toolbox.mapping_functions import map_labels
        >>>
        >>> fingerprint_map = {
        ...     'cust': 'Name',
        ...     'emladdr': 'Email',
        ...     'on': 'Date',
        ...     'phnnmbr': 'Phone'
        ... }
        >>>
        >>> find_column_labels(df, fingerprint_map)
        >>> df
                      Cust.             EML-addr           \nOn  phn-nmbr   col5
        0     Golden jackal    c.aureus@mail.com       03/04/14  333-4444  blank
        1  Pie, rufous tree                 none            NaN  222-3333  empty
        2   Vulture, bengal              blocked       06/01/15  777-7777   none
        3       Arctic tern  s_paradise@mail.com       01/28/16       NaN    NaN
        4   Eurasian badger  meles@othermail.net  notavailable        NaN    NaN
        5   Grant's gazelle   grant@randmail.com              -       NaN    NaN
        >>>
        >>> df.columns = map_labels(df.columns, fingerprint_map)
        >>> df
                       Name                Email           Date     Phone    NaN
        0     Golden jackal    c.aureus@mail.com       03/04/14  333-4444  blank
        1  Pie, rufous tree                 none            NaN  222-3333  empty
        2   Vulture, bengal              blocked       06/01/15  777-7777   none
        3       Arctic tern  s_paradise@mail.com       01/28/16       NaN    NaN
        4   Eurasian badger  meles@othermail.net  notavailable        NaN    NaN
        5   Grant's gazelle   grant@randmail.com              -       NaN    NaN
        
        Standardize null values and remove empty rows/columns:
        
        >>> from etl_toolbox.dataframe_functions import dataframe_clean_null
        >>>
        >>> dataframe_clean_null(df)
        >>> df
                       Name                Email      Date     Phone
        0     Golden jackal    c.aureus@mail.com  03/04/14  333-4444
        1  Pie, rufous tree                  NaN       NaN  222-3333
        2   Vulture, bengal                  NaN  06/01/15  777-7777
        3       Arctic tern  s_paradise@mail.com  01/28/16       NaN
        4   Eurasian badger  meles@othermail.net       NaN       NaN
        5   Grant's gazelle   grant@randmail.com       NaN       NaN
        
        Or clean individual data values:
        
        >>> from etl_toolbox.cleaning_functions import clean_whitespace
        >>>
        >>> clean_whitespace(''' 123   abc 456
        ...                               def\t\t 789\t''')
        '123 abc 456 def 789'
        
        Contributing
        ------------
        
        Contributions are appreciated! There are multiple ways to contribute:
        
        Bug Reports
        ~~~~~~~~~~~
        
        Bug reports help make this library more robust. A good bug report should include:
        
        #. A clear description of the problem (the *expected* behavior vs the *actual* behavior)
        #. A minimal, reproducible example (see `the Stack Overflow guide`_)
        #. The platform and versions involved (operating system, Python version, ``etl-toolbox`` version, ``pandas``/``numpy`` version if applicable, etc)
        
        Submit bug reports with the `issue tracker`_ on GitHub.
        
        Feature Requests
        ~~~~~~~~~~~~~~~~
        
        `Open an issue`_ to discuss features you'd like to see added to etl-toolbox.
        
        Pull Requests
        ~~~~~~~~~~~~~
        
        Follow these steps for submitting pull requests:
        
        #. Find an issue or feature on the `issue tracker`_.
        #. Fork `this repository`_ on GitHub and make changes in a branch.
        #. Add tests to confirm that the bugfix/feature works as expected.
        #. Run the entire test suite and coverage report with ``pytest --doctest-modules --doctest-glob=*.rst --cov=etl_toolbox``. Make sure text coverage is 100% and all tests are passing.
        #. Submit a pull request.
        
        The code style for etl-toolbox mostly follows PEP8_. A linter like Flake8_ is recommended for double checking new contributions.
        
        Release History
        ---------------
        
        -  0.0.1
        
           -  Work in progress
        
Keywords: etl pandas data cleaning
Platform: any
Classifier: Programming Language :: Python :: 3
Classifier: Development Status :: 4 - Beta
Classifier: Natural Language :: English
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.5
Description-Content-Type: text/x-rst
Provides-Extra: testing
