# -*- coding: utf-8 -*-
from setuptools import setup

packages = \
['fibberio']

package_data = \
{'': ['*']}

install_requires = \
['PyYAML>=6.0,<7.0',
 'click>=8.0.3,<9.0.0',
 'numpy>=1.22.2,<2.0.0',
 'pandas>=1.4.1,<2.0.0',
 'parsimonious>=0.8.1,<0.9.0']

entry_points = \
{'console_scripts': ['fibber = fibberio.cli:cli']}

setup_kwargs = {
    'name': 'fibberio',
    'version': '0.1.6',
    'description': '',
    'long_description': '# fibber\n\n(This is still under development)\n\nTeaching machine learning things is hard. The idea behind this library is to generate data in such a way that certain principles can be highlighted without resorting to "finding" the perfect dataset to do so.\n\nCurrently the library can be installed using `pip`:\n\n```\npip install fibberio\n```\n\nOnce the library is installed in your python environment, you can start generating data by:\n\n```\nfibber -t .\\tests\\data\\programmers.json -o .\\sandbox\\programmers.csv -c 10000\n```\n\nwhere `-t` is the Task Description file and `-o` is the output file. To specify the record count, the `-c` flag is used. Successfully running the command should show the following:\n\n```\nGenerating 10000 items using "programmers.json"\n-----------------------------------------------\n\n       FirstName  LastName           age  style          desc accept\ncount      10000     10000  10000.000000  10000  10000.000000  10000\nunique       966      1000           NaN      2           NaN      2\ntop         Remy  Anthony            NaN   tabs           NaN  False\nfreq          29        21           NaN   6642           NaN   5378\nmean         NaN       NaN     35.985700    NaN     21.736883    NaN\nstd          NaN       NaN      4.983832    NaN     10.526532    NaN\nmin          NaN       NaN     18.000000    NaN      5.010000    NaN\n25%          NaN       NaN     33.000000    NaN     12.580000    NaN\n50%          NaN       NaN     36.000000    NaN     20.070000    NaN\n75%          NaN       NaN     39.000000    NaN     34.660000    NaN\nmax          NaN       NaN     57.000000    NaN     36.800000    NaN\n\nSaving csv to C:\\projects\\fibberio\\sandbox\\programmers.csv\nTask complete\n```\n\nThe [programmers.json](./tests/data/programmers.json) file is a good starting point for understanding task descriptions.\n\n# Task Description\n\nThe best way to understand how it works is to look at a task description:\n\n```json\n{\n  "sources": {\n    "names": {\n      "path": "./full_names.csv",\n      "read_csv": {\n        "encoding": "unicode_escape",\n        "engine": "python"\n      }\n    }\n  },\n  "features": {\n    "FirstName": {\n      "source": {\n        "id": "names",\n        "target": "FirstName"\n      }\n    }\n    "age": {\n      "normal": {\n        "mean": 36,\n        "stddev": 5,\n        "precision": 0\n      }\n    },\n    "style": {\n      "discrete": {\n        "tabs": 2,\n        "spaces": 1\n      }\n    }\n  }\n}\n```\n\nThere are two specific sections:\n\n1. **Sources** - external reference data\n2. **Features** - columns to generate\n\n## Sources\n\nThe `sources` section contains a dictionary containing references to external files with data that can be sampled later as features.\n\n```json\n{\n    "key": {\n        "path": "path_to_file",\n        "read_csv": {\n            "encoding": "unicode_escape",\n            "engine": "python"\n        }\n    }\n}\n```\n\nThe `key` is the identifier used to reference this data source later in the features. [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) in this case is the call to the pandas [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function call with the enclosed dictionary representing the `**kwargs` passed to that function. In theory, any pandas call to load any file type can be used here (although as of the time of this writing, `read_csv` is the only one that has been tried).\n\nThe `path` to the data file (in the case above [./full_names.csv](./tests/data/full_names.csv)) is *in relation to the task description file* unless the full path is specified.\n\n## Features\n\nThe `features` section contains the features the system should generate along with their corresponding distributions:\n\n```json\n  "features": {\n    "FirstName": {\n      "source": {\n        "id": "names",\n        "target": "FirstName"\n      }\n    },\n    "age": {\n      "normal": {\n        "mean": 36,\n        "stddev": 5,\n        "precision": 0\n      }\n    },\n    "style": {\n      "discrete": {\n        "tabs": 2,\n        "spaces": 1\n      }\n    }\n  }\n```\n\nIn this example there are exactly three features:\n\n1. **FirstName** - this references the `names` source and samples from the `FirstName` column\n2. **age** - this samples from the `normal` distribution with three parameters passed in to the `Normal` class as `**kwargs`\n3. **style** - this samples from a discrete distribution that will generate `tabs` and `spaces` in a 2 to 1 ratio\n\nThe standard definition for a feature therefore consists of:\n\n```json\n{\n  "feature_id": {\n    "distribution_class": {\n      [... distribution args ...]\n    },\n    "conditional": {\n      [... optional conditional feature generator ...]\n    }\n  }\n}\n```\n\nWhere the `feature_id` represents the id of the feature and the column name (this can be overriden in certain samplers). The `distribution_class` is the name of a `Distribution` class which is instantiated with the corresponding args.\n\nEssentially, if the Distribution class is instantiated by:\n\n```\ndistribution_class(prop1=2, prop2=seismic)\n```\n\nthen the corresponding `kwargs` should look like\n\n```\n{\n  "prop1": 2,\n  "prop2": "seismic"\n}\n```\n\nand get instantiated by\n\n```\ndistribution_class(**kwargs)\n```\n\nI am optimizing for readibility as opposed to brevity. This requires the class to have an `__init()__` with default named parameters.\n\nThe optional `conditional` part of the feature is described next.\n\n## Conditionals\n\nFeature conditionals allow for conditional sampling based on the parent distribution. Here\'s an example:\n\n```json\n{\n  "age": {\n    "uniform": {\n        "low": 14,\n        "high": 85,\n        "itype": "float",\n        "precision": 2\n    },\n    "conditional": {\n      "score": {\n        "[14, 65)": {\n          "uniform": {\n            "low": 5,\n            "high": 25,\n            "itype": "float",\n            "precision": 2\n          }\n        },\n        "[65, *)": {\n          "normal": {\n            "mean": 35,\n            "stddev": 0.5\n          }\n        },\n        "*": {\n          "uniform": {\n            "low": 5,\n            "high": 25,\n            "itype": "float",\n            "precision": 2\n          }\n        }\n      }\n    }\n  }\n}\n```\n\nThis describes `score` feature conditioned on the `age` feature. Since the parent distribution is continuous, the conditional subdivisions should be represented by ranges:\n\n- $[a, b]$ the closed interval ${ x \\in \\mathbb{R}: a \\le x \\le b }$\n- $[a, b)$ the interval ${ x \\in \\mathbb{R}: a \\le x \\lt b }$\n- $(a, b]$ the interval ${ x \\in \\mathbb{R}: a \\lt x \\le b }$\n- $(a, b)$ the open interval ${ x \\in \\mathbb{R}: a \\lt x \\lt b }$\n\nwith `*` representing a catch within the range interval or as the "catch-all" - these are processed in order and an exception is raised if none of the criteria fit.\n\nThe task processes each top level feature and then passes the generated value to the conditional which evaluates each range and generates from the distribution which "catches" the generated top level value.\n\nThis also is true for discrete probability distributions:\n\n```json\n{\n  "style": {\n    "discrete": {\n      "tabs": 234,\n      "spaces": 2332,\n      "agile": 21,\n      "scrum": 128\n    },\n    "conditional": {\n      "score": {\n        "tabs": {\n          "uniform": {\n            "low": 5,\n            "high": 25,\n            "itype": "float",\n            "precision": 2\n          }\n        },\n        "*": {\n          "normal": {\n            "mean": 12,\n            "stddev": 3\n          }\n        }\n      }\n    }\n  }\n}\n```\n\nIn this case, the conditional `score` feature will sample from the `uniform` distribution if "tabs" is generated for the `style` feature, otherwise the catch-all `*` will sample from the `normal` distribution.\n\nThese can be infinitely nested:\n\n```json\n{\n  "style": {\n    "discrete": {\n      "tabs": 234,\n      "spaces": 2332,\n      "agile": 21,\n      "scrum": 128\n    },\n    "conditional": {\n      "score": {\n        "tabs": {\n          "uniform": {\n            "low": 5,\n            "high": 25,\n            "itype": "float",\n            "precision": 2\n          }\n        },\n        "*": {\n          "normal": {\n            "mean": 12,\n            "stddev": 3\n          }\n        }\n      }\n    },\n    "conditional": {\n      "accepted": {\n        "[14, 65)": {\n          "uniform": {\n            "low": 5,\n            "high": 25,\n            "itype": "float",\n            "precision": 2\n          }\n        },\n        "[65, *)": {\n          "normal": {\n            "mean": 35,\n            "stddev": 0.5\n          }\n        },\n        "*": {\n          "uniform": {\n            "low": 5,\n            "high": 25,\n            "itype": "float",\n            "precision": 2\n          }\n        }\n      }\n    }\n  }\n}\n```\n\nNotice that in this case, the first conditional required discrete values while the second used ranges. An exception is raised if there is a mismatch.\n\nThe main idea is that every Feature has a `distribution` and an optional `conditional`.\n\n\n',
    'author': 'sethjuarez',
    'author_email': 'me@sethjuarez.com',
    'maintainer': None,
    'maintainer_email': None,
    'url': 'https://github.com/sethjuarez/fibberio',
    'packages': packages,
    'package_data': package_data,
    'install_requires': install_requires,
    'entry_points': entry_points,
    'python_requires': '>=3.9,<4.0',
}


setup(**setup_kwargs)
