Metadata-Version: 2.1
Name: sdmetrics
Version: 0.5.1.dev0
Summary: Metrics for Synthetic Data Generation Projects
Home-page: https://github.com/sdv-dev/SDMetrics
Author: MIT Data To AI Lab
Author-email: dailabmit@gmail.com
License: MIT license
Description: <div align="center">
        <br/>
        <p align="center">
            <i>This repository is part of <a href="https://sdv.dev">The Synthetic Data Vault Project</a>, a project from <a href="https://datacebo.com">DataCebo</a>.</i>
        </p>
        
        [![Development Status](https://img.shields.io/badge/Development%20Status-2%20--%20Pre--Alpha-yellow)](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
        [![PyPI Shield](https://img.shields.io/pypi/v/sdmetrics.svg)](https://pypi.python.org/pypi/sdmetrics)
        [![Downloads](https://pepy.tech/badge/sdmetrics)](https://pepy.tech/project/sdmetrics)
        [![Tests](https://github.com/sdv-dev/SDMetrics/workflows/Run%20Tests/badge.svg)](https://github.com/sdv-dev/SDMetrics/actions?query=workflow%3A%22Run+Tests%22+branch%3Amaster)
        [![Coverage Status](https://codecov.io/gh/sdv-dev/SDMetrics/branch/master/graph/badge.svg)](https://codecov.io/gh/sdv-dev/SDMetrics)
        
        <div align="left">
        <br/>
        <p align="center">
        <a href="https://github.com/sdv-dev/SDV">
        <img align="center" width=40% src="https://github.com/sdv-dev/SDV/blob/master/docs/images/SDMetrics-DataCebo.png"></img>
        </a>
        </p>
        </div>
        
        </div>
        
        # Overview
        
        The **SDMetrics** library provides a set of **dataset-agnostic tools** for evaluating the **quality
        of a synthetic database** by comparing it to the real database that it is modeled after.
        
        | Important Links                               |                                                                      |
        | --------------------------------------------- | -------------------------------------------------------------------- |
        | :computer: **[Website]**                      | Check out the SDV Website for more information about the project.    |
        | :orange_book: **[SDV Blog]**                  | Regular publshing of useful content about Synthetic Data Generation. |
        | :book: **[Documentation]**                    | Quickstarts, User and Development Guides, and API Reference.         |
        | :octocat: **[Repository]**                    | The link to the Github Repository of this library.                   |
        | :scroll: **[License]**                        | The entire ecosystem is published under the MIT License.             |
        | :keyboard: **[Development Status]**           | This software is in its Pre-Alpha stage.                             |
        | [![][Slack Logo] **Community**][Community]    | Join our Slack Workspace for announcements and discussions.          |
        | [![][MyBinder Logo] **Tutorials**][Tutorials] | Run the SDV Tutorials in a Binder environment.                       |
        
        [Website]: https://sdv.dev
        [SDV Blog]: https://sdv.dev/blog
        [Documentation]: https://sdv.dev/SDV
        [Repository]: https://github.com/sdv-dev/SDMetrics
        [License]: https://github.com/sdv-dev/SDMetrics/blob/master/LICENSE
        [Development Status]: https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha
        [Slack Logo]: https://github.com/sdv-dev/SDV/blob/master/docs/images/slack.png
        [Community]: https://bit.ly/sdv-slack-invite
        [MyBinder Logo]: https://github.com/sdv-dev/SDV/blob/master/docs/images/mybinder.png
        [Tutorials]: https://mybinder.org/v2/gh/sdv-dev/SDV/master?filepath=tutorials
        
        ## Features
        
        It supports multiple data modalities:
        
        * **Single Columns**: Compare 1 dimensional `numpy` arrays representing individual columns.
        * **Column Pairs**: Compare how columns in a `pandas.DataFrame` relate to each other, in groups of 2.
        * **Single Table**: Compare an entire table, represented as a `pandas.DataFrame`.
        * **Multi Table**: Compare multi-table and relational datasets represented as a python `dict` with
          multiple tables passed as `pandas.DataFrame`s.
        * **Time Series**: Compare tables representing ordered sequences of events.
        
        It includes a variety of metrics such as:
        
        * **Statistical metrics** which use statistical tests to compare the distributions of the real
          and synthetic distributions.
        * **Detection metrics** which use machine learning to try to distinguish between real and synthetic data.
        * **Efficacy metrics** which compare the performance of machine learning models when run on the synthetic and real data.
        * **Bayesian Network and Gaussian Mixture metrics** which learn the distribution of the real data
          and evaluate the likelihood of the synthetic data belonging to the learned distribution.
        * **Privacy metrics** which evaluate whether the synthetic data is leaking information about the real data.
        
        # Install
        
        **SDMetrics** is part of the **SDV** project and is automatically installed alongside it. For
        details about this process please visit the [SDV Installation Guide](
        https://sdv.dev/SDV/getting_started/install.html)
        
        Optionally, **SDMetrics** can also be installed as a standalone library using the following commands:
        
        **Using `pip`:**
        
        ```bash
        pip install sdmetrics
        ```
        
        **Using `conda`:**
        
        ```bash
        conda install -c conda-forge -c pytorch sdmetrics
        ```
        
        For more installation options please visit the [SDMetrics installation Guide](INSTALL.md)
        
        # Usage
        
        **SDMetrics** is included as part of the framework offered by SDV to evaluate the quality of
        your synthetic dataset. For more details about how to use it please visit the corresponding
        User Guide:
        
        * [Evaluating Synthetic Data](https://sdv.dev/SDV/user_guides/evaluation/index.html)
        
        ## Standalone usage
        
        **SDMetrics** can also be used as a standalone library to run metrics individually.
        
        In this short example we show how to use it to evaluate a toy multi-table dataset and its
        synthetic replica by running all the compatible multi-table metrics on it:
        
        ```python3
        import sdmetrics
        
        # Load the demo data, which includes:
        # - A dict containing the real tables as pandas.DataFrames.
        # - A dict containing the synthetic clones of the real data.
        # - A dict containing metadata about the tables.
        real_data, synthetic_data, metadata = sdmetrics.load_demo()
        
        # Obtain the list of multi table metrics, which is returned as a dict
        # containing the metric names and the corresponding metric classes.
        metrics = sdmetrics.multi_table.MultiTableMetric.get_subclasses()
        
        # Run all the compatible metrics and get a report
        sdmetrics.compute_metrics(metrics, real_data, synthetic_data, metadata=metadata)
        ```
        
        The output will be a table with all the details about the executed metrics and their score:
        
        | metric                       | name                                         |      score |   min_value |   max_value | goal     |
        |------------------------------|----------------------------------------------|------------|-------------|-------------|----------|
        | CSTest                       | Chi-Squared                                  |   0.76651  |           0 |           1 | MAXIMIZE |
        | KSComplement                 | Complement to Kolmogorov-Smirnov D statistic |   0.75     |           0 |           1 | MAXIMIZE |
        | KSTestExtended               | Inverted Kolmogorov-Smirnov D statistic      |   0.777778 |           0 |           1 | MAXIMIZE |
        | LogisticDetection            | LogisticRegression Detection                 |   0.882716 |           0 |           1 | MAXIMIZE |
        | SVCDetection                 | SVC Detection                                |   0.833333 |           0 |           1 | MAXIMIZE |
        | BNLikelihood                 | BayesianNetwork Likelihood                   | nan        |           0 |           1 | MAXIMIZE |
        | BNLogLikelihood              | BayesianNetwork Log Likelihood               | nan        |        -inf |           0 | MAXIMIZE |
        | LogisticParentChildDetection | LogisticRegression Detection                 |   0.619444 |           0 |           1 | MAXIMIZE |
        | SVCParentChildDetection      | SVC Detection                                |   0.916667 |           0 |           1 | MAXIMIZE |
        
        # What's next?
        
        If you want to read more about each individual metric, please visit the following folders:
        
        * Single Column Metrics: [sdmetrics/single_column](sdmetrics/single_column)
        * Single Table Metrics: [sdmetrics/single_table](sdmetrics/single_table)
        * Multi Table Metrics: [sdmetrics/multi_table](sdmetrics/multi_table)
        * Time Series Metrics: [sdmetrics/timeseries](sdmetrics/timeseries)
        
        ---
        
        
        <div align="center">
        <a href="https://datacebo.com"><img align="center" width=40% src="https://github.com/sdv-dev/SDV/blob/master/docs/images/DataCebo.png"></img></a>
        </div>
        <br/>
        <br/>
        
        [The Synthetic Data Vault Project](https://sdv.dev) was first created at MIT's [Data to AI Lab](
        https://dai.lids.mit.edu/) in 2016. After 4 years of research and traction with enterprise, we
        created [DataCebo](https://datacebo.com) in 2020 with the goal of growing the project.
        Today, DataCebo is the proud developer of SDV, the largest ecosystem for
        synthetic data generation & evaluation. It is home to multiple libraries that support synthetic
        data, including:
        
        * 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
        * 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular,
          multi table and time series data.
        * 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data
          generation models.
        
        [Get started using the SDV package](https://sdv.dev/SDV/getting_started/install.html) -- a fully
        integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries
        for specific needs.
        
        
        # History
        
        ## v0.5.0 - 2022-05-11
        
        This release fixes an error  where the relational `KSTest` crashes if a table doesn't have numerical columns.
        It also includes some housekeeping, updating the pomegranate and copulas version requirements.
        
        ### Issues closed
        
        * Cap pomegranate to <0.14.7 - Issue [#116](https://github.com/sdv-dev/SDMetrics/issues/116) by @csala
        * Relational KSTest crashes with IncomputableMetricError if a table doesn't have numerical columns - Issue [#109](https://github.com/sdv-dev/SDMetrics/pull/109) by @katxiao
        
        ## v0.4.1 - 2021-12-09
        
        This release improves the handling of metric errors, and updates the default transformer behavior used in SDMetrics.
        
        ### Issues closed
        
        * Report metric errors from compute_metrics - Issue [#107](https://github.com/sdv-dev/SDMetrics/issues/107) by @katxiao
        * Specify default categorical transformers - Issue [#105](https://github.com/sdv-dev/SDMetrics/pull/105) by @katxiao
        
        ## v0.4.0 - 2021-11-16
        
        This release adds support for Python 3.9 and updates dependencies to ensure compatibility with the
        rest of the SDV ecosystem, and upgrades to the latests [RDT](https://github.com/sdv-dev/RDT/releases/tag/v0.6.1)
        release.
        
        ### Issues closed
        
        * Replace `sktime` for `pyts` - Issue [#103](https://github.com/sdv-dev/SDMetrics/issues/103) by @pvk-developer
        * Add support for Python 3.9 - Issue [#102](https://github.com/sdv-dev/SDMetrics/issues/102) by @pvk-developer
        * Increase code style lint - Issue [#80](https://github.com/sdv-dev/SDMetrics/issues/80) by @fealho
        * Add `pip check` to `CI` workflows - Issue [#79](https://github.com/sdv-dev/SDMetrics/issues/79) by @pvk-developer
        * Upgrade dependency ranges - Issue [#69](https://github.com/sdv-dev/SDMetrics/issues/69) by @katxiao
        
        ## v0.3.2 - 2021-08-16
        
        This release makes `pomegranate` an optional dependency.
        
        ### Issues closed
        
        * Make pomegranate an optional dependency - Issue [#63](https://github.com/sdv-dev/SDMetrics/issues/63) by @fealho
        
        ## v0.3.1 - 2021-07-12
        
        This release fixes a bug to make the privacy metrics available in the API docs.
        It also updates dependencies to ensure compatibility with the rest of the SDV ecosystem.
        
        ### Issues closed
        
        * `CategoricalSVM` not being imported - Issue [#65](https://github.com/sdv-dev/SDMetrics/issues/65) by @csala
        
        ## v0.3.0 - 2021-03-30
        
        This release includes privacy metrics to evaluate if the real data could be obtained or
        deduced from the synthetic samples. Additionally all the metrics have a `normalize` method
        which takes the `raw_score` generated by the metric and returns a value between `0 ` and `1`.
        
        ### Issues closed
        
        * Add normalize method to metrics - Issue [#51](https://github.com/sdv-dev/SDMetrics/issues/51) by @csala and @fealho
        * Implement privacy metrics - Issue [#36](https://github.com/sdv-dev/SDMetrics/issues/36) by @ZhuofanXie and @fealho
        
        ## v0.2.0 - 2021-02-24
        
        Dependency upgrades to ensure compatibility with the rest of the SDV ecosystem.
        
        ## v0.1.3 - 2021-02-13
        
        Updates the required dependecies to facilitate a conda release.
        
        ### Issues closed
        
        * Upgrade sktime - Issue [#49](https://github.com/sdv-dev/SDMetrics/issues/49) by @fealho
        
        ## v0.1.2 - 2021-01-27
        
        Big fixing release that addresses several minor errors.
        
        ### Issues closed
        
        * More splits than classes - Issue [#46](https://github.com/sdv-dev/SDMetrics/issues/46) by @fealho
        * Scipy 1.6.0 causes an AttributeError - Issue [#44](https://github.com/sdv-dev/SDMetrics/issues/44) by @fealho
        * Time series metrics fails with variable length timeseries - Issue [#42](https://github.com/sdv-dev/SDMetrics/issues/42) by @fealho
        * ParentChildDetection metrics KeyError - Issue [#39](https://github.com/sdv-dev/SDMetrics/issues/39) by @csala
        
        ## v0.1.1 - 2020-12-30
        
        This version adds Time Series Detection and Efficacy metrics, as well as a fix
        to ensure that Single Table binary classification efficacy metrics work well
        with binary targets which are not boolean.
        
        ### Issues closed
        
        * Timeseries efficacy metrics - Issue [#35](https://github.com/sdv-dev/SDMetrics/issues/35) by @csala
        * Timeseries detection metrics - Issue [#34](https://github.com/sdv-dev/SDMetrics/issues/34) by @csala
        * Ensure binary classification targets are bool - Issue [#33](https://github.com/sdv-dev/SDMetrics/issues/33) by @csala
        
        ## v0.1.0 - 2020-12-18
        
        This release introduces a new project organization and API, with metrics
        grouped by data modality, with a common API:
        
        * Single Column
        * Column Pair
        * Single Table
        * Multi Table
        * Time Series
        
        Within each data modality, different families of metrics have been implemented:
        
        * Statistical
        * Detection
        * Bayesian Network and Gaussian Mixture Likelihood
        * Machine Learning Efficacy
        
        ## v0.0.4 - 2020-11-27
        
        Patch release to relax dependencies and avoid conflicts when using the latest SDV version.
        
        ## v0.0.3 - 2020-11-20
        
        Fix error on detection metrics when input data contains infinity or NaN values.
        
        ### Issues closed
        
        * ValueError: Input contains infinity or a value too large for dtype('float64') - Issue [#11](https://github.com/sdv-dev/SDMetrics/issues/11) by @csala
        
        ## v0.0.2 - 2020-08-08
        
        Add support for Python 3.8 and a broader range of dependencies.
        
        ## v0.0.1 - 2020-06-26
        
        First release to PyPI.
        
Keywords: sdmetrics sdmetrics SDMetrics
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Python: >=3.6,<3.10
Description-Content-Type: text/markdown
Provides-Extra: test
Provides-Extra: pomegranate
Provides-Extra: dev
