Metadata-Version: 2.1
Name: pipelime
Version: 0.1.3
Summary: data pipeline 101
Home-page: https://github.com/eyecan-ai/pipelime
Author: daniele de gregorio
Author-email: daniele.degregorio@eyecan.ai
License: GNU General Public License v3
Keywords: pipelime
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.5
Provides-Extra: minio
License-File: LICENSE

===========
🍋 Pipelime
===========

Data Pipeline 101

.. image:: https://img.shields.io/pypi/v/pipelime.svg
        :target: https://pypi.python.org/pypi/pipelime

.. image:: https://img.shields.io/travis/eyecan-ai/pipelime.svg
        :target: https://travis-ci.com/eyecan-ai/pipelime

.. image:: https://readthedocs.org/projects/pipelime/badge/?version=latest
        :target: https://pipelime.readthedocs.io/en/latest/?badge=latest
        :alt: Documentation Status

.. image:: https://pyup.io/repos/github/eyecan-ai/pipelime/shield.svg
     :target: https://pyup.io/repos/github/eyecan-ai/pipelime/
     :alt: Updates

Installation
============

.. code-block:: bash

        pip install pipelime

Basic Usage
===========

Underfolder Format
------------------

The **Underfolder** format is one of the pipelime dataset formats: i.e. a flexible way to 
(model and) store a generic dataset through **filesystem**. 

.. image:: docs/images/underfolder.png
  :width: 400
  :align: center
  :alt: underfolder structure

An Underfolder **dataset** is a collection of samples. A **sample** is a collection of items. 
An **item** is a unitary block of data, it can be a generic tensor (e.g. a multi-channel image, 
or a plain matrix), a dictionary. or more.

Underfolder datasets must contain a subfolder named ``data`` that will actually contain the 
samples and items. Optionally you can store the items in the root folder directly, they 
will act as “global” items injected into each sample.

.. image:: docs/images/naming.png
  :width: 400
  :align: center
  :alt: naming convention

Items are named using the following naming convention:

.. code-block:: bash

        $ID_$ITEM.$EXT

Where:

* ``$ID`` is the sample identifier, must be a unique string for each sample. 
* ``ITEM`` is the item name.
* ``EXT`` is the item extension. Currently supported extensions are:
  
  * The most common image formats like PNG, JPEG, BMP, and many others...
  * YAML and JSON for dictionary-like objects.
  * TXT for numpy 2D matrix notation.
  * NPY and NPZ for numpy arrays.
  * PKL for generic pickable python objects.

Root files follow the same convention but they lack the sample identifier part:

.. code-block:: bash

        $ITEM.$EXT

Reading an Underfolder Dataset
------------------------------

Pipelime provides an intuitive interface to read, manipulate and write Underfolder Datasets.
You don't have to memorize complex signatures, instantiate weird object iterators, or write
tens of lines of boilerplate code. It all boils down to a **reader**, a **writer** and objects that 
behave like built-in python types such as **lists** and **dictionaries**.

.. code-block:: python

        from pipelime.sequences.readers.filesystem import UnderfolderReader

        # Read an underfolder dataset with a single line of code
        dataset = UnderfolderReader('tests/sample_data/datasets/underfolder_minimnist')

        # A dataset behaves like a Sequence
        len(dataset) # The number of samples (20)
        sample = dataset[4] # Get a Sample from the dataset

        # A Sample is a MutableMapping
        len(sample) # The number of items (10)
        set(sample.keys()) # The set of all the item names {'cfg', 'image', 'image_mask', ...}
        item = sample['image'] # Get an item from the sample

        # An item can be any python object, depending on which extension is used to store it.
        type(item) # numpy.ndarray
        item.shape # (28, 28, 3)

Writing an Underfolder Dataset
------------------------------

You can **write** a dataset by simply creating and running a writer object.

.. code-block:: python

        from pipelime.sequences.writers.filesystem import UnderfolderWriter

        # Create the writer object from a destination path
        writer = UnderfolderWriter('/tmp/my_output_dataset')
        # Write the dataset to file system
        writer(dataset)

By default, ``UnderfolderWriter`` saves every sample with the extension it was originally read with.
If for any reason it is unable to retrieve the original extension, it will use **pickle** to
serialize the object.

If you don't want to use pickle, you can choose a **custom extension** for each item name.
You can also choose which items are going to be saved as **root files** (if the contained data 
is the same for all samples).

.. code-block:: python

        from pipelime.sequences.writers.filesystem import UnderfolderWriter

        # These items are going to be saved as root files
        root_files = ['cfg', 'numbers', 'pose']

        # Associate a custom extension to each item name
        extensions = {
                'image': 'jpg',
                'image_mask': 'png',
                'image_maskinv': 'png',
                'label': 'txt',
                'metadata': 'json',
                'metadatay': 'yml',
                'points': 'txt',
                'numbers': 'txt',
                'pose': 'txt',
                'cfg': 'yml'
        }

        # Create a customized writer object
        writer = UnderfolderWriter(
                '/tmp/my_output_dataset',
                root_files_keys=root_files,
                extensions_map=extensions,
        )
        # Write the dataset to file system
        writer(dataset)

