Metadata-Version: 2.1
Name: AutoDataCleaner
Version: 1.1.3
Summary: Simple and automatic data cleaning in one line of code! It performs one-hot encoding, date & time casting to datetime dtype, detects binary columns, safely convert non-numeric columns to numeric dtypes, cleaning dirty/empty values, normalizing values and removing unwanted columns all in one line of code. Get your data ready for model training and fitting quickly.
Home-page: https://github.com/sinkingtitanic/AutoDataCleaner
Author: Sinking Titanic
Author-email: ofcourse7878@gmail.com
License: UNKNOWN
Description: # AutoDataCleaner
        
        [![version](https://img.shields.io/badge/Version-1.1.3-lightgrey)](https://github.com/sinkingtitanic/AutoDataCleaner)
        [![build](https://img.shields.io/badge/Pypi%20Build-Stable-blue)](https://pypi.org/project/AutoDataCleaner/)
        [![python-version](https://img.shields.io/badge/Python-3^-success)](https://www.python.org/downloads/)
        [![coverage](https://img.shields.io/badge/coverage-%25100-success)](https://pypi.org/project/AutoDataCleaner/)
        
        ![Preview](https://raw.githubusercontent.com/sinkingtitanic/AutoDataCleaner/main/images/autodatacleaner.png)
        
        Simple and automatic data cleaning in one line of code! It performs  **one-hot encoding**, **converts columns to numeric dtype**, **cleaning dirty/empty values**, **normalizes values** and **removes unwanted columns** all in one line of code.
        Get your data ready for model training and fitting quickly.
        # Features 
        0. **Uses Pandas DataFrames** (no need to learn new syntax)
        1. **One-hot encoding**: encodes non-numeric values to one-hot encoding columns 
        2. **Converts columns to numeric dtypes**: converts text numbers to numeric dtypes **see [1] below**
        3. **Auto detects binary columns**: any column that has two unique values, these values will be replaced with 0 and 1 (e.g.: `['looser', 'winner'] => [0,1]`)
        4. **Normalization**: performs normalization to columns (excludes binary [1/0] columns)
        5. **Cleans Dirty/None/NA/Empty values**: replace None values with mean or mode of a column, delete row that has None cell or substitute None values with pre-defined value
        6. **Delete Unwanted Columns**: drop and remove unwanted columns (usually this will be the 'id' column)
        7. **Converts date, time or datetime columns to datetime dtype**
        # Installation 
        #### Using pip
        `pip install AutoDataCleaner`
        #### Cloning repo: 
        Clone repository and run `pip install -e .` inside the repository directory
        #### Install from repo directly
        Install from repository directly using `pip install git+git://github.com/sinkingtitanic/AutoDataCleaner.git#egg=AutoDataCleaner`
        # Quick One-line Usage: 
        ```
            import AutoDataCleaner.AutoDataCleaner as adc
            adc.clean_me(dataframe, 
                    detect_binary=True, 
                    numeric_dtype=True, 
                    one_hot=True, 
                    na_cleaner_mode="mean", 
                    normalize=True, 
                    datetime_columns=[], 
                    remove_columns=[], 
                    verbose=True)
        ```
        # Example 
        ```
        >>> import pandas as pd
        >>> import AutoDataCleaner.AutoDataCleaner as adc
        >>> df = pd.DataFrame([
        ...     [1, "Male", "white", 3, "2018/11/20"], 
        ...     [2, "Female", "blue", "4", "2014/01/12"],
        ...     [3, "Male", "white", 15, "2020/09/02"], 
        ...     [4, "Male", "blue", "5", "2020/09/02"], 
        ...     [5, "Male", "green", None, "2020/12/30"]
        ...     ], columns=['id', 'gender', 'color', 'weight', 'created_on'])
        >>> 
        >>> adc.clean_me(df, 
        ...     detect_binary=True, 
        ...     numeric_dtype=True, 
        ...     one_hot=True, 
        ...     na_cleaner_mode="mode", 
        ...     normalize=True, 
        ...     datetime_columns=["created_on"], 
        ...     remove_columns=["id"], 
        ...     verbose=True)
         +++++++++++++++ AUTO DATA CLEANING STARTED ++++++++++++++++ 
         =  AutoDataCleaner: Casting datetime columns to datetime dtype... 
          + converted column created_on to datetime dtype
         =  AutoDataCleaner: Performing removal of unwanted columns... 
          + removed 1 columns successfully.
         =  AutoDataCleaner: Performing One-Hot encoding... 
          + detected 1 binary columns [['gender']], cells cleaned: 5 cells
         = AutoDataCleaner: Converting columns to numeric dtypes when possible...
          + 1 minority (minority means < %25 of 'weight' entries) values that cannot be converted to numeric dtype in column 'weight' have been set to NaN, nan cleaner function will deal with them
          + converted 5 cells to numeric dtypes
         =  AutoDataCleaner: Performing One-Hot encoding... 
          + one-hot encoding done, added 2 new columns
         =  AutoDataCleaner: Performing None/NA/Empty values cleaning... 
          + cleaned the following NaN values: {'weight NaN Values': 1}
         =  AutoDataCleaner: Performing dataset normalization... 
          + normalized 5 cells
         +++++++++++++++ AUTO DATA CLEANING FINISHED +++++++++++++++ 
           gender    weight created_on  color_blue  color_green  color_white
        0       1 -0.588348 2018-11-20           0            0            1
        1       0 -0.392232 2014-01-12           1            0            0
        2       1  1.765045 2020-09-02           0            0            1
        3       1 -0.196116 2020-09-02           1            0            0
        4       1 -0.588348 2020-12-30           0            1            0
        
        
        ```
        
        **If you want to pick and choose with more customization, please go to `AutoDataCleaner.py` (the code is highly documented for your convenience)**
        # Explaining Parameters 
        
        `adc.clean_me(dataframe, detect_binary=True, one_hot=True, na_cleaner_mode="mean", normalize=True, remove_columns=[], verbose=True)`
        
        Parameters & what do they mean
        _Call the help function `adc.help()` to output the below instructions_ 
        * `dataframe`: input Pandas DataFrame on which the cleaning will be performed <br />
        * `detect_binary`: if True, any column that has two unique values, these values will be replaced with 0 and 1 (e.g.: ['looser', 'winner'] => [0,1]) <br />
        * `numeric_dtype`: if True, columns will be converted to numeric dtypes when possible **see [1] below**
        * `one_hot`: if True, all non-numeric columns will be encoded to one-hot columns <br />
        * `na_cleaner_mode`: what technique to use when dealing with None/NA/Empty values. Modes: <br />
            * `False`: do not consider cleaning na values <br />
            * `'remove row'`: removes rows with a cell that has NA value<br />
            * `'mean'`: substitues empty NA cells with the mean of that column <br /> 
            * `'mode'`: substitues empty NA cells with the mode of that column<br />
            * `'*'`: any other value will substitute empty NA cells with that particular value passed here <br />
        * `normalize`: if True, all non-binray (columns with values 0 or 1 are excluded) columns will be normalized. <br />
        * `datetime_columns`: a list of columns which contains date or time or datetime entries (important to be announced in this list, otherwise `normalize_df` and `convert_numeric_df` functions  will mess up these columns)
        * `remove_columns`: list of columns to remove, this is usually non-related featues such as the ID column <br />
        * `verbose`: print progress in terminal/cmd<br />
        * `returns`: processed and clean Pandas DataFrame <br />
        
        [1] When `numeric_dtype` is set to True, columns that have strings of numbers (e.g.: "123" instead of 123) will be converted to numeric dtype. 
        if in a particular column, the values that cannot be converted to numeric dtypes are minority in that column (< 25% of total entries in that column), these
        minority non-numeric values in that column will be converted to NaN; then, the NaN cleaner function will handle them according to your settings. See `convert_numeric_df()` function in `AutoDataCleaner.py` file for more documentation.
        
        # Prediction 
        In prediction phase, put the examples to be predicted in Pandas DataFrame and run them through `adc.clean_me()` function **with the same parameters you
        used during training**.
        
        # Contribution & Maintenance
        This repository is seriously commented for your convenience; please feel free to send me feedback on "ofcourse7878@gmail.com", submit an issue or make a pull request! 
        
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Topic :: Database
Description-Content-Type: text/markdown
