Metadata-Version: 2.1
Name: irisml
Version: 0.0.14
Summary: Simple ML pipeline platform
Home-page: https://github.com/microsoft/irisml
Author: irisdev
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE

# IrisML

Proof of Concept for a simple framework to create a ML pipeline.


# Features
- Run a ML training/inference with a simple JSON configuration.
- Modularized interfaces for task components.
- Cache task outputs for faster experiments.

# Getting started
## Installation
Prerequisite: python 3.8+

```
# Install the core framework and standard tasks.
pip install irisml irisml-tasks irisml-tasks-training
```

## Run an example job
```
# Install additional packages that are required for the example
pip install irisml-tasks-torchvision

# Run on local machine
irisml_run docs/examples/mobilenetv2_mnist_training.json
```

## Available commands
```
# Run the specified pipeline. You can provide environment variables by "-e" option, which will be acceible through $env variable in the json config.
irisml_run <pipeline_json_path> [-e <ENV_NAME>=<env_value>] [--no_cache] [-v]

# Show information about the specified task. If <task_name> is not provided, shows a list of available tasks in the current environment.
irisml_show [<task_name>]

# Manage a cache storage on Azure Blob Storage
# list - Show a list of matched blobs.
# download - Download matched blobs.
# remove - Remove matched blobs.
# show - Show the contents of matched blobs.
irisml_cache <list|download|remove|show> [--mtime <+|->N] [--name NAME]
```

## Pipeline definition
```
PipelineDefinition = {"tasks": List[TaskDefinition], "on_error": Optional[List[TaskDescription]]}

TaskDefinition = {
    "task": <task module name>,
    "name": <optional unique name of the task>,
    "inputs": <list of input objects>,
    "config": <config for the task. Use irisml_show command to find the available configurations.>
}
```
In the TaskDefinition.inputs and TaskDefinition.config, you cna use the following two variable.
- $env.<variable_name>
  This variable will be replaced by the environment variable that was provided as arguments for irisml_run command.
- $outputs.<task_name>.<field_name>
  This variable will be replaced by the outputs of the specified previous task.

It raises an exception on runtime if the specified variable was not found.

If a task raised an exception, the tasks specified in on_error field will be executed. The exception object will be assigned to "$env.IRISML_EXCEPTION" variable.

## Pipeline cache
Using cache, you can modify and re-run a pipeline config with minimum cost. If the cache is enabled, IrisML will calculate hash values for all task inputs/configs and upload the task outputs to a specified storage. When it found a task with same hash values, it can download the cache and skip the task execution.

To enable cache, you must specify the cache storage location by setting IRISML_CACHE_URL environment variable. Currently Azure Blob Storage and local filesystem is supported.

To use Azure Blob Storage, a container URL must be provided. It the URL contains a SAS token, it will be used for authentication. Otherwise, interactive authentication and Managed Identity authentication will be used.

# Available official tasks

To show the detailed help for each task, run the following command after installing the package.
```
irisml_show <task_name>
```

## [irisml-tasks](https://github.com/microsoft/irisml-tasks)
| Task | Description |
| ---- | ----------- |
| assertion | Test assertion task. |
| branch | Run different tasks based on a condition. |
| calculate_cosine_similarity | Calculate a cosine similarity between two tensors. |
| download_azure_blob | Download a blob from Azure Blob Storage. |
| get_current_time | Get a current time. |
| get_dataset_split | Split a dataset into two. |
| get_dataset_stats | Get statistics of a dataset. |
| get_dataset_subset | Get a subset of a dataset. |
| get_fake_image_classification_dataset | Create a fake image classification dataset for testing. |
| get_fake_object_detection_dataset | Create a fake object detection dataset for testing. |
| get_item | Get an element from a list. |
| get_secret_from_azure_keyvault | Get a secret value from Azure KeyVault. |
| get_topk | Get TopK values from a tensor. |
| join_filepath | Join path components. |
| load_state_dict | Load a state_dict into a pytorch model. |
| print_environment_info | Print information about current environment. |
| run_parallel | Run multiple tasks in parallel. |
| run_sequential | Run multiple tasks in sequential. |
| save_file | Save a binary as a file. |
| save_state_dict | Save a state_dict of a model as a file. |
| search_grid_sequential | Grid search hyperparameters. |
| upload_azure_blob | Upload a blob to Azure Blob Storage. |

## [irisml-tasks-training](https://github.com/microsoft/irisml-tasks-training)
This package contains tasks related to pytorch training
| Task | Description |
| ---- | ----------- |
| append_classifier | Append a classifier layer to a model. |
| benchmark_model | Get forward/backward pass speed of a model. |
| build_classification_prompt_dataset | Create a text dataset using a prompt generator. |
| build_zero_shot_classifier | Generate a zero-shot classification layer. |
| create_classification_prompt_generator | Create a prompt generator for the classification task. |
| evaluate_accuracy | Evaluate accuracy of prediction results. |
| evaluate_detection_average_precision | Evaluate Average Precision of prediction results for the Object Detection task. |
| export_onnx | Export a pytorch model as ONNX. |
| get_targets_from_dataset | Extract targets from a dataset. |
| make_feature_extractor_model | Make a model to extract feature vectors from a model. |
| make_image_text_contrastive_model | Make a image-text contrastive model. |
| make_image_text_transform | Make a preprocessing function for image-text dataset. |
| make_oversampled_dataset | Make a new dataset by oversampling a dataset. |
| predict | Run inference. |
| split_image_text_model | Split a image-text model into an image model and a text model. |
| train | Train a model. |
| train_with_gradient_cache | Train a model with gradient caching. |

## [irisml-tasks-torchvision](https://github.com/microsoft/irisml-tasks-torchvision)
Adapter tasks for torchvision library.
| Task | Description |
| ---- | ----------- |
| create_torchvision_model | Create a model using the torchvision library. |
| create_torchvision_transform | Create a preprocessing function using the torchvision library. |
| load_torchvision_dataset | Load a dataset from the torchvision library. |

## [irisml-tasks-transformers](https://github.com/microsoft/irisml-tasks-transformers)
Adapter tasks for HuggingFace transformers library.
| Task | Description |
| ---- | ----------- |
| create_transformers_model | Create a model using the transformers library. |
| create_transformers_tokenizer | Create a tokenizer using the transformers library. |

## [irisml-tasks-timm](https://github.com/microsoft/irisml-tasks-timm)
Adapter for models in timm library.
| Task | Description |
| ---- | ----------- |
| create_timm_model | Create a model using the timm library. |
| create_timm_transform | Create a preprocessing function using the timm library. |

## [irisml-tasks-onnx](https://github.com/microsoft/irisml-tasks-onnx)
Adapter tasks for OnnxRuntime library.
| Task | Description |
| ---- | ----------- |
| predict_onnx | Run inference for an ONNX model. |

## [irisml-tasks-azureml](https://github.com/microsoft/irisml-tasks-azureml)
| Task | Description |
| ---- | ----------- |
| run_azureml_child | Run tasks as a new child AzureML Run. |
| add_aml_tag | Tag the AML Run with a string key and optional value. |

## [irisml-tasks-fiftyone](https://github.com/microsoft/irisml-tasks-fiftyone)
| Task | Description |
| ---- | ----------- |
| launch_fiftyone | Launch a fiftyone interface. |

# Development
## Create a new task
To create a Task, you must define a module that contains a "Task" class. Here is a simple example:
```python
# irisml/tasks/my_custom_task.py
import dataclasses
import irisml.core

class Task(irisml.core.TaskBase):  # The class name must be "Task".
  VERSION = '1.0.0'
  CACHE_ENABLED = True  # (default: True) This is optional.

  @dataclasses.dataclass
  class Inputs:  # You can remove this class if the task doesn't require inputs.
    int_value: int
    float_value: float

  @dataclasses.dataclass
  class Config:  # If there is no configuration, you can remove this class. All fields must be JSON-serializable.
    another_float: float
    child_dataclass: dataclass  # If you'd like to define a nested config, you can define another dataclass.

  @dataclasses.dataclass
  class Outputs:  # Can be removed if the task doesn't have outputs.
    float_value: float = 0  # If dry_run() is not implemented, Outputs fields must have default value or default factory.

  def execute(self, inputs: Inputs) -> Outputs:
    return self.Outputs(inputs.int_value * inputs.float_value * self.config.another_float)

  def dry_run(self, inputs: Inputs) -> Outputs:  # This method is optional.
    return self.Outputs(0)  # Must return immediately without actual processing.
```

Each Task must define "execute" method. The base class has empty implementation for Inputs, Config, Outputs and dry_run(). For the detail, please see the document for TaskBase class.

# Related repositories
- [irisml-tasks](https://github.com/microsoft/irisml-tasks)
- [irisml-tasks-training](https://github.com/microsoft/irisml-tasks-training)
- [irisml-tasks-torchvision](https://github.com/microsoft/irisml-tasks-torchvision)
- [irisml-tasks-transformers](https://github.com/microsoft/irisml-tasks-transformers)
- [irisml-tasks-timm](https://github.com/microsoft/irisml-tasks-timm)
- [irisml-tasks-azureml](https://github.com/microsoft/irisml-tasks-azureml)
- [irisml-tasks-fiftyone](https://github.com/microsoft/irisml-tasks-fiftyone)
