Metadata-Version: 2.1
Name: servicex
Version: 2.0.0b2
Summary: Front-end for the ServiceX Data Server
Home-page: https://github.com/iris-hep/func_adl_xAOD
Author: G. Watts (IRIS-HEP/UW Seattle)
Author-email: gwatts@uw.edu
Maintainer: Gordon Watts (IRIS-HEP/UW Seattle)
Maintainer-email: gwatts@uw.edu
License: TBD
Description: # ServiceX_frontend
        
         Client access library for ServiceX
        
        [![GitHub Actions Status](https://github.com/ssl-hep/ServiceX_frontend/workflows/CI/CD/badge.svg)](https://github.com/ssl-hep/ServiceX_frontend/actions)
        [![Code Coverage](https://codecov.io/gh/ssl-hep/ServiceX_frontend/graph/badge.svg)](https://codecov.io/gh/ssl-hep/ServiceX_frontend)
        
        [![PyPI version](https://badge.fury.io/py/servicex.svg)](https://badge.fury.io/py/servicex)
        [![Supported Python versions](https://img.shields.io/pypi/pyversions/servicex.svg)](https://pypi.org/project/servicex/)
        
        ## Introduction
        
        Given you have a selection string, this library will manage submitting it to a ServiceX instance and retrieving the data locally for you.
        The selection string is often generated by another front-end library, for example:
        
        - [func_adl.xAOD](https://github.com/iris-hep/func_adl_xAOD) (for ATLAS xAOD's)
        - [func_adl.uproot](https://github.com/iris-hep/func_adl.uproot) (for flat ntuples)
        - xxx for columns
        
        ## Prerequisites
        
        Before you can use this library you'll need:
        
        - An environment based on python 3.6 or later
        - A `ServiceX` end-point. For example, `http://localhost:5000/servicex`, if `ServiceX` is running on a local `k8` cluster and the proper ports are open.
        
        ## Usage
        
        The following lines will return a `pandas.DataFrame` containing all the jet pT's from an ATLAS xAOD file containing Z->ee Monte Carlo:
        
        ```python
            from servicex import ServiceX
            query = "(call ResultTTree (call Select (call SelectMany (call EventDataset (list 'localds:bogus')) (lambda (list e) (call (attr e 'Jets') 'AntiKt4EMTopoJets'))) (lambda (list j) (/ (call (attr j 'pt')) 1000.0))) (list 'JetPt') 'analysis' 'junk.root')"
            dataset = "mc15_13TeV:mc15_13TeV.361106.PowhegPythia8EvtGen_AZNLOCTEQ6L1_Zee.merge.DAOD_STDM3.e3601_s2576_s2132_r6630_r6264_p2363_tid05630052_00"
            ds = ServiceX(dataset, endpoint='http://localhost:5000/servicex')
            r = ds.get_data_pandas_df(query)
            print(r)
        ```
        
        And the output in a terminal window from running the above script (takes about 1-2 minutes to complete):
        
        ```bash
        python scripts\run_test.py http://localhost:5000/servicex
                    JetPt
        entry
        0       38.065707
        1       31.967096
        2        7.881337
        3        6.669581
        4        5.624053
        ...           ...
        710183  42.926141
        710184  30.815709
        710185   6.348002
        710186   5.472711
        710187   5.212714
        
        [11355980 rows x 1 columns]
        ```
        
        If your query is badly formed or there is an other problem with the backend, an exception will be thrown with information about the error.
        
        If you'd like to be able to submit multiple queries and have them run on the `ServiceX` back end in parallel, it is best to use the `asyncio` interface, which has the identical signature, but is called `get_data_pandas_df_async`.
        
        For documentation of `get_data` and `get_data_async` see the `servicex.py` source file.
        
        ## Features
        
        Implemented:
        
        - Accepts a `qastle` formatted query
        - Exceptions are used to report back errors of all sorts from the service to the user's code.
        - Data is return in the following forms:
          - `pandas.DataFrame` an in process DataFrame of all the data requested
          - `awkward` an in process `JaggedArray` or dictionary of `JaggedArray`s
          - A list of root files that can be opened with `uproot` and used as desired.
          - Not all output formats are compatible with all transformations.
        - Complete returned data must fit in the process' memory
        - Run in an async or a non-async environment and non-async methods will accommodate automatically (including `jupyter` notebooks).
        - Support up to 100 simultaneous queries from a laptop-like front end without overwhelming the local machine (hopefully ServiceX will be overwhelmed!)
        - Start downloading files as soon as they are ready (before ServiceX is done with the complete transform).
        - It has been tested to run against 100 datasets with multiple simultaneous queries.
        - It supports local caching of query data
        - It will provide feedback on progress.
        
        ## Testing
        
        This code has been tested in several environments:
        
        - Windows, Linux, MacOS
        - Python 3.6, 3.7, 3.8
        - Jupyter Notebooks (not automated), regular python command-line invoked source files
        
        ## API
        
        Everything is based around the `ServiceX` object.
        
        ```python
        |  ServiceX(dataset: str,
                    service_endpoint: str = 'http://localhost:5000/servicex',
                    image: str = 'sslhep/servicex_func_adl_xaod_transformer:v0.4',
                    storage_directory: Union[str, NoneType] = None,
                    file_name_func: Union[Callable[[str, str], pathlib.Path], NoneType] = None,
                    max_workers: int = 20,
                    status_callback_factory: Callable[[str], Callable[[Union[int, NoneType], int, int, int], NoneType]] = _run_default_wrapper)
         |      Create and configure a ServiceX object for a dataset.
         |
         |      Arguments
         |
         |          dataset                     Name of a dataset from which queries will be selected.
         |          service_endpoint            Where the ServiceX web API is found
         |          image                       Name of transformer image to use to transform the data
         |          storage_directory           Location to cache data that comes back from ServiceX. Data
         |                                      can be reused between invocations.
         |          file_name_func              Allows for unique naming of the files that come back.
         |                                      Rarely used.
         |          max_workers                 Maximum number of transformers to run simultaneously on
         |                                      ServiceX.
         |          status_callback_factory     Factory to create a status notification callback for each
         |                                      query. One is created per query.
         |
         |
         |      Notes:
         |
         |          -  The `status_callback` argument, by default, uses the `tqdm` library to render
         |             progress bars in a terminal window or a graphic in a Jupyter notebook (with proper
         |             jupyter extensions installed). If `status_callback` is specified as None, no
         |             updates will be rendered. A custom callback function can also be specified which
         |             takes `(total_files, transformed, downloaded, skipped)` as an argument. The
         |             `total_files` parameter may be `None` until the system knows how many files need to
         |             be processed (and some files can even be completed before that is known).
         ```
        
        To get the data use one of the `get_data` method. They all have the same API, differing only by what they return.
        
        ```python
         |  get_data_awkward_async(self, selection_query: str) -> Dict[bytes, Union[awkward.array.jagged.JaggedArray, numpy.ndarray]]
         |      Fetch query data from ServiceX matching `selection_query` and return it as
         |      dictionary of awkward arrays, an entry for each column. The data is uniquely
         |      ordered (the same query will always return the same order).
         |
         |  get_data_awkward(self, selection_query: str) -> Dict[bytes, Union[awkward.array.jagged.JaggedArray, numpy.ndarray]]
         |      Fetch query data from ServiceX matching `selection_query` and return it as
         |      dictionary of awkward arrays, an entry for each column. The data is uniquely
         |      ordered (the same query will always return the same order).
        ```
        
        Each data type comes in a pair - an `async` version and a synchronous version.
        
        - `get_data_awkward_async, get_data_awkward` - Returns a dictionary of the requested data as `numpy` or `JaggedArray` objects.
        - `get_data_rootfiles`, `get_data_rootfiles_async` - Returns a list of locally download files (as `pathlib.Path` objects) containing the requested data. Suitable for opening with [`ROOT::TFile`](https://root.cern.ch/doc/master/classTFile.html) or [`uproot`](https://github.com/scikit-hep/uproot).
        - `get_data_pandas_df`, `get_data_pandas_df_async` - Returns the data as a `pandas` `DataFrame`. This will fail if the data you've requested has any structure (e.g. is hierarchical, like a single entry for each event, and each event may have some number of jets).
        - `get_data_parquet`, `get_data_parquet_async` - Returns a list of files locally downloaded that can be read by any parquet tools.
        
        ## Development
        
        For any changes please feel free to submit pull requests!
        
        To do development please setup your environment with the following steps:
        
        1. A python 3.7 development environment
        1. Fork/Pull down this package, XX
        1. `python -m pip install -e .[test]`
        1. Run the tests to make sure everything is good: `pytest`.
        
        Then add tests as you develop. When you are done, submit a pull request with any required changes to the documentation and the online tests will run.
        
Platform: Any
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Programming Language :: Python
Classifier: Topic :: Software Development
Classifier: Topic :: Utilities
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Provides-Extra: test
