Metadata-Version: 2.1
Name: etl-toolbox
Version: 0.0.3
Summary: Useful ETL functions for Python
Home-page: https://etl-toolbox.readthedocs.io
Author: Brooklyn Rose Ludlow
License: Apache-2.0
Project-URL: Source Code, https://github.com/brookcub/etl-toolbox
Project-URL: Bug Reports, https://github.com/brookcub/etl-toolbox/issues
Project-URL: Documentation, https://etl-toolbox.readthedocs.io
Description: etl-toolbox
        ===========
        
           Useful ETL functions for Python
        
        **etl-toolbox** is a Python library of simple but powerful functions for ETL and
        data cleaning. It contains tools that are useful for nearly any ETL pipeline,
        with a specific focus on the data variety challenges that arise when compiling
        data from many sources.
        
        .. image:: https://img.shields.io/github/workflow/status/brookcub/etl-toolbox/Tests?logo=github
           :alt: GitHub Build Status
           :target: https://github.com/brookcub/etl-toolbox/actions?query=workflow%3ATests+branch%3Amaster
        
        .. image:: https://codecov.io/gh/brookcub/etl-toolbox/branch/master/graph/badge.svg
           :alt: Coverage
           :target: https://codecov.io/gh/brookcub/etl-toolbox
        
        .. image:: https://img.shields.io/readthedocs/etl-toolbox?label=docs&logo=Read%20the%20Docs
           :alt: Read the Docs
           :target: https://etl-toolbox.readthedocs.io/en/latest/
        
        .. image:: https://img.shields.io/pypi/v/etl-toolbox?color=ffa313
           :alt: PyPI Version
           :target: https://pypi.python.org/pypi/etl-toolbox
        
        .. image:: https://img.shields.io/pypi/pyversions/etl-toolbox?color=0882ac
           :alt: Supported Python versions
           :target: https://pypi.python.org/pypi/etl-toolbox
        
        .. image:: https://img.shields.io/pypi/l/etl-toolbox?color=0882ac
           :alt: License
           :target: https://pypi.python.org/pypi/etl-toolbox
        
        Features
        --------
        
        - Standardize various null-indicating values (``'blank'``, ``'none'``, ``'null'``, etc) to Python's ``None``
        - Trim, condense, and standardize whitespace with a single function
        - Locate and rename column labels in messy files
        
        Quick Start
        -----------
        
        Installation
        ~~~~~~~~~~~~
        
        .. _PyPI: http://pypi.python.org/pypi/etl-toolbox/
        .. _pip: https://pip.pypa.io/en/stable/quickstart/
        
        Install from PyPI_ using pip_::
        
              $ pip install etl_toolbox
        
        Usage
        ~~~~~
        
        >>> import pandas as pd
        >>>
        >>> df = pd.read_csv('./test_data/bad-data.csv')
        >>> df  # doctest:+SKIP
                 Unnamed: 0           Unnamed: 1            Unnamed: 2    Unnamed: 3 Unnamed: 4
        0      created by:   Brookcub Industries  for testing purposes           NaN        NaN
        1             date:           2020-06-07                3 rows  some columns        NaN
        2               NaN                  NaN                   NaN           NaN        NaN
        3             Cust.             EML-addr                On          phn-nmbr       col5
        4     Golden jackal    c.aureus@mail.com              03/04/14      333-4444      blank
        5  Pie, rufous tree                 none                   NaN      222-3333      empty
        6   Vulture, bengal              blocked              06/01/15      777-7777       none
        7       Arctic tern  s_paradise@mail.com              01/28/16           NaN        NaN
        8   Eurasian badger  meles@othermail.net         notavailable            NaN        NaN
        9   Grant's gazelle   grant@randmail.com                     -           NaN        NaN
        
        Find and standardize column labels using a dictionary of the expected values:
        
        >>> from etl_toolbox.dataframe_functions import find_column_labels
        >>> from etl_toolbox.mapping_functions import map_labels
        >>>
        >>> fingerprint_map = {
        ...     'cust': 'Name',
        ...     'emladdr': 'Email',
        ...     'on': 'Date',
        ...     'phnnmbr': 'Phone'
        ... }
        >>>
        >>> find_column_labels(df, fingerprint_map)
        >>> df
                      Cust.             EML-addr         On      phn-nmbr   col5
        0     Golden jackal    c.aureus@mail.com       03/04/14  333-4444  blank
        1  Pie, rufous tree                 none            NaN  222-3333  empty
        2   Vulture, bengal              blocked       06/01/15  777-7777   none
        3       Arctic tern  s_paradise@mail.com       01/28/16       NaN    NaN
        4   Eurasian badger  meles@othermail.net  notavailable        NaN    NaN
        5   Grant's gazelle   grant@randmail.com              -       NaN    NaN
        >>>
        >>> df.columns = map_labels(df.columns, fingerprint_map)
        >>> df
                       Name                Email           Date     Phone      -
        0     Golden jackal    c.aureus@mail.com       03/04/14  333-4444  blank
        1  Pie, rufous tree                 none            NaN  222-3333  empty
        2   Vulture, bengal              blocked       06/01/15  777-7777   none
        3       Arctic tern  s_paradise@mail.com       01/28/16       NaN    NaN
        4   Eurasian badger  meles@othermail.net  notavailable        NaN    NaN
        5   Grant's gazelle   grant@randmail.com              -       NaN    NaN
        
        Standardize null values and remove empty rows/columns:
        
        >>> from etl_toolbox.dataframe_functions import dataframe_clean_null
        >>>
        >>> dataframe_clean_null(df)
        >>> df
                       Name                Email      Date     Phone
        0     Golden jackal    c.aureus@mail.com  03/04/14  333-4444
        1  Pie, rufous tree                  NaN       NaN  222-3333
        2   Vulture, bengal                  NaN  06/01/15  777-7777
        3       Arctic tern  s_paradise@mail.com  01/28/16       NaN
        4   Eurasian badger  meles@othermail.net       NaN       NaN
        5   Grant's gazelle   grant@randmail.com       NaN       NaN
        
        Or clean individual data values:
        
        >>> from etl_toolbox.cleaning_functions import clean_whitespace
        >>>
        >>> clean_whitespace(''' 123   abc 456
        ...                               def\t\t 789\t''')
        '123 abc 456 def 789'
        
        .. docs-exclusion-marker-start
        
        Documentation
        -------------
        
        Full documentation is hosted at `etl-toolbox.readthedocs.io <https://etl-toolbox.readthedocs.io>`_.
        
        .. docs-exclusion-marker-end
        
        Contributing
        ------------
        
        .. _Open an issue: https://github.com/brookcub/etl-toolbox/issues/new
        .. _issue tracker: https://github.com/brookcub/etl-toolbox/issues
        .. _this repository: https://github.com/brookcub/etl-toolbox/
        .. _the Stack Overflow guide: https://stackoverflow.com/help/minimal-reproducible-example
        .. _PEP8: https://www.python.org/dev/peps/pep-0008/
        .. _Flake8: https://flake8.pycqa.org/en/latest/
        
        Contributions are appreciated! There are multiple ways to contribute:
        
        Bug Reports
        ~~~~~~~~~~~
        
        Bug reports help make this library more robust. A good bug report should include:
        
        #. A clear description of the problem (the *expected* behavior vs the *actual* behavior)
        #. A minimal, reproducible example (see `the Stack Overflow guide`_)
        #. The platform and versions involved (operating system, Python version, ``etl-toolbox`` version, ``pandas``/``numpy`` version if applicable, etc)
        
        Submit bug reports with the `issue tracker`_ on GitHub.
        
        Feature Requests
        ~~~~~~~~~~~~~~~~
        
        `Open an issue`_ to discuss features you'd like to see added to etl-toolbox.
        
        Pull Requests
        ~~~~~~~~~~~~~
        
        Follow these steps for submitting pull requests:
        
        #. Find an issue or feature on the `issue tracker`_.
        #. Fork `this repository`_ on GitHub and make changes in a branch.
        #. Add tests to confirm that the bugfix/feature works as expected.
        #. Run the entire test suite and coverage report with ``pytest --doctest-modules --doctest-glob=*.rst --cov=etl_toolbox --ignore=docs/conf.py``. Make sure text coverage is 100% and all tests are passing.
        #. Submit a pull request.
        
        The code style for etl-toolbox mostly follows PEP8_. A linter like Flake8_ is recommended for double checking new contributions.
        
        Release History
        ---------------
        
        -  0.0.3
        
           -  Fix multiple bugs in ``merge_columns_by_label()`` that occurred with certain inputs
           -  Change ``merge_columns_by_label()`` to remove ``None`` and ``np.nan`` values from merged columns
           -  Change ``find_column_labels()`` to check whether the existing column labels fit the match criteria before searching rows
           -  Change ``map_labels()`` to return ``'-'`` instead of ``None`` for unmapped labels
           -  Change ``clean_whitespace()`` to return non-string inputs unaltered instead of raising an exception
        
        -  0.0.2
        
           -  Add GitHub continuous integration
           -  Add project links and badges to readme and PyPI metadata
           -  Fix bug in ``merge_columns_by_label()`` that raises a ``ValueError`` if ``df`` has multiple columns labeled ``None``
        
        -  0.0.1
        
           -  Initial release
        
Keywords: etl pandas data cleaning
Platform: any
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Development Status :: 4 - Beta
Classifier: Natural Language :: English
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.5
Description-Content-Type: text/x-rst
Provides-Extra: testing
