Metadata-Version: 2.1
Name: sdmetrics
Version: 0.8.1
Summary: Metrics for Synthetic Data Generation Projects
Home-page: https://github.com/sdv-dev/SDMetrics
Author: MIT Data To AI Lab
Author-email: dailabmit@gmail.com
License: MIT license
Keywords: sdmetrics sdmetrics SDMetrics
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Python: >=3.6,<3.10
Description-Content-Type: text/markdown
Provides-Extra: test
Provides-Extra: torch
Provides-Extra: pomegranate
Provides-Extra: dev
License-File: LICENSE
License-File: AUTHORS.rst

<div align="center">
<br/>
<p align="center">
    <i>This repository is part of <a href="https://sdv.dev">The Synthetic Data Vault Project</a>, a project from <a href="https://datacebo.com">DataCebo</a>.</i>
</p>

[![Development Status](https://img.shields.io/badge/Development%20Status-2%20--%20Pre--Alpha-yellow)](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
[![PyPI Shield](https://img.shields.io/pypi/v/sdmetrics.svg)](https://pypi.python.org/pypi/sdmetrics)
[![Downloads](https://pepy.tech/badge/sdmetrics)](https://pepy.tech/project/sdmetrics)
[![Tests](https://github.com/sdv-dev/SDMetrics/workflows/Run%20Tests/badge.svg)](https://github.com/sdv-dev/SDMetrics/actions?query=workflow%3A%22Run+Tests%22+branch%3Amaster)
[![Coverage Status](https://codecov.io/gh/sdv-dev/SDMetrics/branch/master/graph/badge.svg)](https://codecov.io/gh/sdv-dev/SDMetrics)
[![Slack](https://img.shields.io/badge/Community-Slack-blue?style=plastic&logo=slack)](https://bit.ly/sdv-slack-invite)
[![Tutorial](https://img.shields.io/badge/Demo-Get%20started-orange?style=plastic&logo=googlecolab)](https://bit.ly/sdmetrics-demo)

<div align="left">
<br/>
<p align="center">
<a href="https://github.com/sdv-dev/SDV">
<img align="center" width=40% src="https://github.com/sdv-dev/SDV/blob/master/docs/images/SDMetrics-DataCebo.png"></img>
</a>
</p>
</div>

</div>

# Overview

The SDMetrics library evaluates synthetic data by comparing it to the real data that you're trying to mimic. It includes a variety of metrics to capture different aspects of the data, for example **quality and privacy**. It also includes reports that you can run to generate insights, visualize data and share with your team.

The SDMetrics library is **model-agnostic**, meaning you can use any synthetic data. The library does not need to know how you created the data. 

<img align="center" src="docs/images/column_comparison.png"></img>

# Install

Install SDMetrics using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.

```bash
pip install sdmetrics
```

```bash
conda install -c conda-forge sdmetrics
```

For more information about using SDMetrics, visit the [SDMetrics Documentation](https://docs.sdv.dev/sdmetrics).

# Usage

Get started with **SDMetrics Reports** using some demo data,

```python
from sdmetrics import load_demo
from sdmetrics.reports.single_table import QualityReport

real_data, synthetic_data, metadata = load_demo(modality='single_table')

my_report = QualityReport()
my_report.generate(real_data, synthetic_data, metadata)
```
```
Creating report: 100%|██████████| 4/4 [00:00<00:00,  5.22it/s]

Overall Quality Score: 82.84%

Properties:
Column Shapes: 82.78%
Column Pair Trends: 82.9%
```

Once you generate the report, you can drill down on the details and visualize the results.

```python
my_report.get_visualization(property_name='Column Pair Trends')
```
<img align="center" src="docs/images/column_pairs.png"></img>

Save the report and share it with your team.
```python
my_report.save(filepath='demo_data_quality_report.pkl')

# load it at any point in the future
my_report = QualityReport.load(filepath='demo_data_quality_report.pkl')
```

**Want more metrics?** You can also manually apply any of the metrics in this library to your data.

```python
# calculate whether the synthetic data respects the min/max bounds
# set by the real data
from sdmetrics.single_column import BoundaryAdherence

BoundaryAdherence.compute(
    real_data['start_date'],
    synthetic_data['start_date']
)
```
```
0.8503937007874016
```

```python
# calculate whether the synthetic data is new or whether it's an exact copy of the real data
from sdmetrics.single_table import NewRowSynthesis

NewRowSynthesis.compute(
    real_data,
    synthetic_data,
    metadata
)
```
```
1.0
```

# What's next?

To learn more about the reports and metrics, visit the [SDMetrics Documentation](https://docs.sdv.dev/sdmetrics). 

---


<div align="center">
<a href="https://datacebo.com"><img align="center" width=40% src="https://github.com/sdv-dev/SDV/blob/master/docs/images/DataCebo.png"></img></a>
</div>
<br/>
<br/>

[The Synthetic Data Vault Project](https://sdv.dev) was first created at MIT's [Data to AI Lab](
https://dai.lids.mit.edu/) in 2016. After 4 years of research and traction with enterprise, we
created [DataCebo](https://datacebo.com) in 2020 with the goal of growing the project.
Today, DataCebo is the proud developer of SDV, the largest ecosystem for
synthetic data generation & evaluation. It is home to multiple libraries that support synthetic
data, including:

* 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
* 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular,
  multi table and time series data.
* 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data
  generation models.

[Get started using the SDV package](https://sdv.dev/SDV/getting_started/install.html) -- a fully
integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries
for specific needs.


# History

## v0.8.0 - 2022-11-02

This release introduces the `DiagnosticReport`, which helps a user verify – at a quick glance – that their data is valid. We also fix an existing bug with detection metrics.

### New Features
* Fixes for new metadata - Issue [#253](https://github.com/sdv-dev/SDMetrics/issues/253) by @katxiao
* Add default synthetic sample size to DiagnosticReport - Issue [#248](https://github.com/sdv-dev/SDMetrics/issues/248) by @katxiao
* Exclude pii columns from single table metrics - Issue [#245](https://github.com/sdv-dev/SDMetrics/issues/245) by @katxiao
* Accept both old and new metadata - Issue [#244](https://github.com/sdv-dev/SDMetrics/issues/244) by @katxiao
* Address Diagnostic Report and metric edge cases - Issue [#243](https://github.com/sdv-dev/SDMetrics/issues/243) by @katxiao
* Update visualization average per table - Issue [#242](https://github.com/sdv-dev/SDMetrics/issues/242) by @katxiao
* Add save and load functionality to multi-table DiagnosticReport - Issue [#218](https://github.com/sdv-dev/SDMetrics/issues/218) by @katxiao
* Visualization methods for the multi-table DiagnosticReport - Issue [#217](https://github.com/sdv-dev/SDMetrics/issues/217) by @katxiao
* Add getter methods to multi-table DiagnosticReport - Issue [#216](https://github.com/sdv-dev/SDMetrics/issues/216) by @katxiao
* Create multi-table DiagnosticReport - Issue [#215](https://github.com/sdv-dev/SDMetrics/issues/215) by @katxiao
* Visualization methods for the single-table DiagnosticReport - Issue [#211](https://github.com/sdv-dev/SDMetrics/issues/211) by @katxiao
* Add getter methods to single-table DiagnosticReport - Issue [#210](https://github.com/sdv-dev/SDMetrics/issues/210) by @katxiao
* Create single-table DiagnosticReport - Issue [#209](https://github.com/sdv-dev/SDMetrics/issues/209) by @katxiao
* Add save and load functionality to single-table DiagnosticReport - Issue [#212](https://github.com/sdv-dev/SDMetrics/issues/212) by @katxiao
* Add single table diagnostic report - Issue [#237](https://github.com/sdv-dev/SDMetrics/issues/237) by @katxiao

### Bug Fixes
* Detection test test doesn't look at metadata when determining which columns to use - Issue [#119](https://github.com/sdv-dev/SDMetrics/issues/119) by @R-Palazzo

### Internal Improvements
* Remove torch dependency - Issue [#233](https://github.com/sdv-dev/SDMetrics/issues/233) by @katxiao
* Update README - Issue [#250](https://github.com/sdv-dev/SDMetrics/issues/250) by @katxiao


## v0.7.0 - 2022-09-27

This release introduces the `QualityReport`, which evaluates how well synthetic data captures mathematical properties from the real data. The `QualityReport` incorporates the new metrics introduced in the previous release, and allows users to get detailed results, visualize the scores, and save the report for future viewing. We also add utility methods for visualizing columns and pairs of columns.

### New Features
* Catch typeerror in new row synthesis query - Issue [#234](https://github.com/sdv-dev/SDMetrics/issues/234) by @katxiao
* Add NewRowSynthesis Metric - Issue [#207](https://github.com/sdv-dev/SDMetrics/issues/207) by @katxiao
* Update plot utilities API - Issue [#228](https://github.com/sdv-dev/SDMetrics/issues/228) by @katxiao
* Fix column pairs visualization bug - Issue [#230](https://github.com/sdv-dev/SDMetrics/issues/230) by @katxiao
* Save version - Issue [#229](https://github.com/sdv-dev/SDMetrics/issues/229) by @katxiao
* Update efficacy metrics API - Issue [#227](https://github.com/sdv-dev/SDMetrics/issues/227) by @katxiao
* Add RangeCoverage Metric - Issue [#208](https://github.com/sdv-dev/SDMetrics/issues/208) by @katxiao
* Add get_column_pairs_plot utility method - Issue [#223](https://github.com/sdv-dev/SDMetrics/issues/223) by @katxiao
* Parse date as datetime - Issue [#222](https://github.com/sdv-dev/SDMetrics/issues/222) by @katxiao
* Update error handling for reports - Issue [#221](https://github.com/sdv-dev/SDMetrics/issues/221) by @katxiao
* Visualization API update - Issue [#220](https://github.com/sdv-dev/SDMetrics/issues/220) by @katxiao
* Bug fixes for QualityReport - Issue [#219](https://github.com/sdv-dev/SDMetrics/issues/219) by @katxiao
* Update column pair metric calculation - Issue [#214](https://github.com/sdv-dev/SDMetrics/issues/214) by @katxiao
* Add get score methods for multi table QualityReport - Issue [#190](https://github.com/sdv-dev/SDMetrics/issues/190) by @katxiao
* Add multi table QualityReport visualization methods - Issue [#192](https://github.com/sdv-dev/SDMetrics/issues/192) by @katxiao
* Add plot_column visualization utility method - Issue [#193](https://github.com/sdv-dev/SDMetrics/issues/193) by @katxiao
* Add save and load behavior to multi table QualityReport - Issue [#188](https://github.com/sdv-dev/SDMetrics/issues/188) by @katxiao
* Create multi-table QualityReport - Issue [#186](https://github.com/sdv-dev/SDMetrics/issues/186) by @katxiao
* Add single table QualityReport visualization methods - Issue [#191](https://github.com/sdv-dev/SDMetrics/issues/191) by @katxiao
* Add save and load behavior to single table QualityReport - Issue [#187](https://github.com/sdv-dev/SDMetrics/issues/187) by @katxiao
* Add get score methods for single table Quality Report - Issue [#189](https://github.com/sdv-dev/SDMetrics/issues/189) by @katxiao
* Create single-table QualityReport - Issue [#185](https://github.com/sdv-dev/SDMetrics/issues/185) by @katxiao

### Internal Improvements
* Auto apply "new" label instead of "pending review" - Issue [#164](https://github.com/sdv-dev/SDMetrics/issues/164) by @katxiao
* fix typo - Issue [#195](https://github.com/sdv-dev/SDMetrics/issues/195) by @fealho


## v0.6.0 - 2022-08-12

This release removes SDMetric's dependency on the RDT library, and also introduces new quality and diagnostic metrics. Additionally, we introduce a new `compute_breakdown` method that returns a breakdown of metric results.

### New Features
* Handle null values correctly - Issue [#194](https://github.com/sdv-dev/SDMetrics/issues/194) by @katxiao
* Add wrapper classes for new single and multi table metrics - Issue [#169](https://github.com/sdv-dev/SDMetrics/issues/169) by @katxiao
* Add CorrelationSimilarity metric - Issue [#143](https://github.com/sdv-dev/SDMetrics/issues/143) by @katxiao
* Add CardinalityShapeSimilarity metric - Issue [#160](https://github.com/sdv-dev/SDMetrics/issues/160) by @katxiao
* Add CardinalityStatisticSimilarity metric - Issue [#145](https://github.com/sdv-dev/SDMetrics/issues/145) by @katxiao
* Add ContingencySimilarity Metric - Issue [#159](https://github.com/sdv-dev/SDMetrics/issues/159) by @katxiao
* Add TVComplement metric - Issue [#142](https://github.com/sdv-dev/SDMetrics/issues/142) by @katxiao
* Add MissingValueSimilarity metric - Issue [#139](https://github.com/sdv-dev/SDMetrics/issues/139) by @katxiao
* Add CategoryCoverage metric - Issue [#140](https://github.com/sdv-dev/SDMetrics/issues/140) by @katxiao
* Add compute breakdown column for single column - Issue [#152](https://github.com/sdv-dev/SDMetrics/issues/152) by @katxiao
* Add BoundaryAdherence metric - Issue [#138](https://github.com/sdv-dev/SDMetrics/issues/138) by @katxiao
* Get KSComplement Score Breakdown - Issue [#130](https://github.com/sdv-dev/SDMetrics/issues/130) by @katxiao
* Add StatisticSimilarity Metric - Issue [#137](https://github.com/sdv-dev/SDMetrics/issues/130) by @katxiao
* New features for KSTest.compute - Issue [#129](https://github.com/sdv-dev/SDMetrics/issues/129) by @amontanez24

### Internal Improvements
* Add integration tests and fixes - Issue [#183](https://github.com/sdv-dev/SDMetrics/issues/183) by @katxiao
* Remove rdt hypertransformer dependency in timeseries metrics - Issue [#176](https://github.com/sdv-dev/SDMetrics/issues/178) by @katxiao
* Replace rdt LabelEncoder with sklearn - Issue [#178](https://github.com/sdv-dev/SDMetrics/issues/178) by @katxiao
* Remove rdt as a dependency - Issue [#182](https://github.com/sdv-dev/SDMetrics/issues/182) by @katxiao
* Use sklearn's OneHotEncoder instead of rdt - Issue [#170](https://github.com/sdv-dev/SDMetrics/issues/170) by @katxiao
* Remove KSTestExtended - Issue [#180](https://github.com/sdv-dev/SDMetrics/issues/180) by @katxiao
* Remove TSFClassifierEfficacy and TSFCDetection metrics - Issue [#171](https://github.com/sdv-dev/SDMetrics/issues/171) by @katxiao
* Update the default tags for a feature request - Issue [#172](https://github.com/sdv-dev/SDMetrics/issues/172) by @katxiao
* Bump github macos version - Issue [#174](https://github.com/sdv-dev/SDMetrics/issues/174) by @katxiao
* Fix pydocstyle to check sdmetrics - Issue [#153](https://github.com/sdv-dev/SDMetrics/issues/153) by @pvk-developer
* Update the RDT version to 1.0 - Issue [#150](https://github.com/sdv-dev/SDMetrics/issues/150) by @pvk-developer
* Update slack invite link - Issue [#132](https://github.com/sdv-dev/SDMetrics/issues/132) by @pvk-developer

## v0.5.0 - 2022-05-11

This release fixes an error  where the relational `KSTest` crashes if a table doesn't have numerical columns.
It also includes some housekeeping, updating the pomegranate and copulas version requirements.

### Issues closed

* Cap pomegranate to <0.14.7 - Issue [#116](https://github.com/sdv-dev/SDMetrics/issues/116) by @csala
* Relational KSTest crashes with IncomputableMetricError if a table doesn't have numerical columns - Issue [#109](https://github.com/sdv-dev/SDMetrics/pull/109) by @katxiao

## v0.4.1 - 2021-12-09

This release improves the handling of metric errors, and updates the default transformer behavior used in SDMetrics.

### Issues closed

* Report metric errors from compute_metrics - Issue [#107](https://github.com/sdv-dev/SDMetrics/issues/107) by @katxiao
* Specify default categorical transformers - Issue [#105](https://github.com/sdv-dev/SDMetrics/pull/105) by @katxiao

## v0.4.0 - 2021-11-16

This release adds support for Python 3.9 and updates dependencies to ensure compatibility with the
rest of the SDV ecosystem, and upgrades to the latests [RDT](https://github.com/sdv-dev/RDT/releases/tag/v0.6.1)
release.

### Issues closed

* Replace `sktime` for `pyts` - Issue [#103](https://github.com/sdv-dev/SDMetrics/issues/103) by @pvk-developer
* Add support for Python 3.9 - Issue [#102](https://github.com/sdv-dev/SDMetrics/issues/102) by @pvk-developer
* Increase code style lint - Issue [#80](https://github.com/sdv-dev/SDMetrics/issues/80) by @fealho
* Add `pip check` to `CI` workflows - Issue [#79](https://github.com/sdv-dev/SDMetrics/issues/79) by @pvk-developer
* Upgrade dependency ranges - Issue [#69](https://github.com/sdv-dev/SDMetrics/issues/69) by @katxiao

## v0.3.2 - 2021-08-16

This release makes `pomegranate` an optional dependency.

### Issues closed

* Make pomegranate an optional dependency - Issue [#63](https://github.com/sdv-dev/SDMetrics/issues/63) by @fealho

## v0.3.1 - 2021-07-12

This release fixes a bug to make the privacy metrics available in the API docs.
It also updates dependencies to ensure compatibility with the rest of the SDV ecosystem.

### Issues closed

* `CategoricalSVM` not being imported - Issue [#65](https://github.com/sdv-dev/SDMetrics/issues/65) by @csala

## v0.3.0 - 2021-03-30

This release includes privacy metrics to evaluate if the real data could be obtained or
deduced from the synthetic samples. Additionally all the metrics have a `normalize` method
which takes the `raw_score` generated by the metric and returns a value between `0 ` and `1`.

### Issues closed

* Add normalize method to metrics - Issue [#51](https://github.com/sdv-dev/SDMetrics/issues/51) by @csala and @fealho
* Implement privacy metrics - Issue [#36](https://github.com/sdv-dev/SDMetrics/issues/36) by @ZhuofanXie and @fealho

## v0.2.0 - 2021-02-24

Dependency upgrades to ensure compatibility with the rest of the SDV ecosystem.

## v0.1.3 - 2021-02-13

Updates the required dependecies to facilitate a conda release.

### Issues closed

* Upgrade sktime - Issue [#49](https://github.com/sdv-dev/SDMetrics/issues/49) by @fealho

## v0.1.2 - 2021-01-27

Big fixing release that addresses several minor errors.

### Issues closed

* More splits than classes - Issue [#46](https://github.com/sdv-dev/SDMetrics/issues/46) by @fealho
* Scipy 1.6.0 causes an AttributeError - Issue [#44](https://github.com/sdv-dev/SDMetrics/issues/44) by @fealho
* Time series metrics fails with variable length timeseries - Issue [#42](https://github.com/sdv-dev/SDMetrics/issues/42) by @fealho
* ParentChildDetection metrics KeyError - Issue [#39](https://github.com/sdv-dev/SDMetrics/issues/39) by @csala

## v0.1.1 - 2020-12-30

This version adds Time Series Detection and Efficacy metrics, as well as a fix
to ensure that Single Table binary classification efficacy metrics work well
with binary targets which are not boolean.

### Issues closed

* Timeseries efficacy metrics - Issue [#35](https://github.com/sdv-dev/SDMetrics/issues/35) by @csala
* Timeseries detection metrics - Issue [#34](https://github.com/sdv-dev/SDMetrics/issues/34) by @csala
* Ensure binary classification targets are bool - Issue [#33](https://github.com/sdv-dev/SDMetrics/issues/33) by @csala

## v0.1.0 - 2020-12-18

This release introduces a new project organization and API, with metrics
grouped by data modality, with a common API:

* Single Column
* Column Pair
* Single Table
* Multi Table
* Time Series

Within each data modality, different families of metrics have been implemented:

* Statistical
* Detection
* Bayesian Network and Gaussian Mixture Likelihood
* Machine Learning Efficacy

## v0.0.4 - 2020-11-27

Patch release to relax dependencies and avoid conflicts when using the latest SDV version.

## v0.0.3 - 2020-11-20

Fix error on detection metrics when input data contains infinity or NaN values.

### Issues closed

* ValueError: Input contains infinity or a value too large for dtype('float64') - Issue [#11](https://github.com/sdv-dev/SDMetrics/issues/11) by @csala

## v0.0.2 - 2020-08-08

Add support for Python 3.8 and a broader range of dependencies.

## v0.0.1 - 2020-06-26

First release to PyPI.
