Metadata-Version: 2.1
Name: colmet
Version: 0.6.6
Summary: A utility to monitor the jobs ressources in a HPC environment, espacially OAR
Home-page: http://oar.imag.fr/
Author: Philippe Le Brouster, Olivier Richard
Author-email: philippe.le-brouster@imag.fr, olivier.richard@imag.fr
Maintainer: Salem Harrache
Maintainer-email: salem.harrache@inria.fr
License: GNU GPL
Description: # Colmet - Collecting metrics about jobs running in a distributed environnement
        
        ## Introduction:
        
        Colmet is a monitoring tool to collect metrics about jobs running in a
        distributed environnement, especially for gathering metrics on clusters and
        grids. It provides currently several backends :
        - Input backends:
          - taskstats: fetch task metrics from the linux kernel
          - rapl: intel processors realtime consumption metrics
          - perfhw: perf_event counters
          - jobproc: get infos from /proc
          - ipmipower: get power metrics from ipmi
          - temperature: get temperatures from /sys/class/thermal
          - infiniband: get infiniband/omnipath network metrics
          - lustre: get lustre FS stats
        - Output backends:
          - elasticsearch: store the metrics on elasticsearch indexes
          - hdf5: store the metrics on the filesystem
          - stdout: display the metrics on the terminal
        
        It uses zeromq to transport the metrics across the network.
        
        It is currently bound to the [OAR](http://oar.imag.fr) RJMS.
        
        A Grafana [sample dashboard](./graph/grafana) is provided for the elasticsearch backend. Here are some snapshots:
        
        ![](./screenshot1.png)
        
        ![](./screenshot2.png)
        
        ## Installation:
        
        ### Requirements
        
        - a Linux kernel that supports
          - Taskstats
          - intel_rapl (for RAPL backend)
          - perf_event (for perfhw backend)
          - ipmi_devintf (for ipmi backend)
        
        - Python Version 2.7 or newer
          - python-zmq 2.2.0 or newer
          - python-tables 3.3.0 or newer
          - python-pyinotify 0.9.3-2 or newer
          - python-requests
        
        - For the Elasticsearch output backend (recommended for sites with > 50 nodes)
          - An Elasticsearch server
          - A Grafana server (for visu)
        
        - For the RAPL input backend:
          - libpowercap, powercap-utils (https://github.com/powercap/powercap)
        
        - For the infiniband backend:
          - `perfquery` command line tool
        
        - for the ipmipower backend:
          - `ipmi-oem` command line tool (freeipmi) or other configurable command
        
        ### Installation
        
        You can install, upgrade, uninstall colmet with these commands::
        
        ```
        $ pip install [--user] colmet
        $ pip install [--user] --upgrade colmet
        $ pip uninstall colmet
        ```
        
        Or from git (last development version)::
        
        ```
        $ pip install [--user] git+https://github.com/oar-team/colmet.git
        ```
        
        Or if you already pulled the sources::
        
        ```
        $ pip install [--user] path/to/sources
        ```
        
        ### Usage:
        
        for the nodes :
        
        ```
        sudo colmet-node -vvv --zeromq-uri tcp://127.0.0.1:5556
        ```
        
        for the collector :
        
        ```
        # Simple local HDF5 file collect:
        colmet-collector -vvv --zeromq-bind-uri tcp://127.0.0.1:5556 --hdf5-filepath /data/colmet.hdf5 --hdf5-complevel 9
        ```
        
        ```
        # Collector with an Elasticsearch backend:
          colmet-collector -vvv \
            --zeromq-bind-uri tcp://192.168.0.1:5556 \
            --buffer-size 5000 \
            --sample-period 3 \
            --elastic-host http://192.168.0.2:9200 \
            --elastic-index-prefix colmet_dahu_ 2>>/var/log/colmet_err.log >> /var/log/colmet.log
        ```
        
        You will see the number of counters retrieved in the debug log.
        
        
        For more information, please refer to the help of theses scripts (`--help`)
        
        ### Notes about backends
        
        Some input backends may need external libraries that need to be previously compiled and installed:
        
        ```
        # For the perfhw backend:
        cd colmet/node/backends/lib_perf_hw/ && make && cp lib_perf_hw.so /usr/local/lib/
        # For the rapl backend:
        cd colmet/node/backends/lib_rapl/ && make && cp lib_rapl.so /usr/local/lib/
        ```
        
        Here's acomplete colmet-node start-up process, with perfw, rapl and more backends:
        
        ```
        export LIB_PERFHW_PATH=/usr/local/lib/lib_perf_hw.so
        export LIB_RAPL_PATH=/applis/site/colmet/lib_rapl.so
        
        colmet-node -vvv --zeromq-uri tcp://192.168.0.1:5556 \
           --cpuset_rootpath /dev/cpuset/oar \
           --enable-infiniband --omnipath \
           --enable-lustre \
           --enable-perfhw --perfhw-list instructions cache_misses page_faults cpu_cycles cache_references \
           --enable-RAPL \
           --enable-jobproc \
           --enable-ipmipower >> /var/log/colmet.log 2>&1
        ```
        
        #### RAPL - Running Average Power Limit (Intel)
        
        RAPL is a feature on recent Intel processors that makes possible to know the power consumption of cpu in realtime.
        
        Usage : start colmet-node with option `--enable-RAPL`
        
        A file named RAPL_mapping.[timestamp].csv is created in the working directory. It established the correspondence between `counter_1`, `counter_2`, etc from collected data and the actual name of the metric as well as the package and zone (core / uncore / dram) of the processor the metric refers to.
        
        If a given counter is not supported by harware the metric name will be "`counter_not_supported_by_hardware`" and `0` values will appear in the collected data; `-1` values in the collected data means there is no counter mapped to the column.
        
        #### Perfhw
        
        This provides metrics collected using  interface [perf_event_open](http://man7.org/linux/man-pages/man2/perf_event_open.2.html).
        
        Usage : start colmet-node with option `--enable-perfhw`
        
        Optionnaly choose the metrics you want (max 5 metrics) using options `--perfhw-list` followed by space-separated list of the metrics/
        
        Example : `--enable-perfhw --perfhw-list instructions cpu_cycles cache_misses`
        
        A file named perfhw_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between `counter_1`, `counter_2`, etc from collected data and the actual name of the metric.
        
        Available metrics (refers to perf_event_open documentation for signification) :
        
        ```
        cpu_cycles 
        instructions 
        cache_references 
        cache_misses 
        branch_instructions
        branch_misses
        bus_cycles 
        ref_cpu_cycles 
        cache_l1d 
        cache_ll
        cache_dtlb 
        cache_itlb 
        cache_bpu 
        cache_node 
        cache_op_read 
        cache_op_prefetch 
        cache_result_access 
        cpu_clock 
        task_clock 
        page_faults 
        context_switches 
        cpu_migrations
        page_faults_min
        page_faults_maj
        alignment_faults 
        emulation_faults
        dummy
        bpf_output
        ```
        
        #### Temperature
        
        This backend gets temperatures from `/sys/class/thermal/thermal_zone*/temp`
        
        Usage : start colmet-node with option `--enable-temperature`
        
        A file named temperature_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between `counter_1`, `counter_2`, etc from collected data and the actual name of the metric.
        
        
        
        Colmet CHANGELOG
        ================
        
        Version 0.6.6
        -------------
        - Added --no-check-certificates option for elastic backend
        - Added involved jobs and new metrics into jobprocstats
        
        Version 0.6.4
        -------------
        
        - Added http auth support for elasticsearch backend
        
        
        Version 0.6.3
        -------------
        
        Released on September 4th 2020
        
        - Bugfixes into lustrestats and jobprocstats backend
        
        Version 0.6.2
        -------------
        
        Released on September 3rd 2020
        
        - Python package fix
        
        Version 0.6.1
        -------------
        
        Released on September 3rd 2020
        
        - New input backends: lustre, infiniband, temperature, rapl, perfhw, impipower, jobproc
        - New ouptut backend: elasticsearch
        - Example Grafana dashboard for Elasticsearch backend
        - Added "involved_jobs" value for metrics that are global to a node (job 0)
        - Bugfix for "dictionnary changed size during iteration"
        
        Version 0.5.4
        -------------
        
        Released on January 19th 2018
        
        - hdf5 extractor script for OAR RESTFUL API
        - Added infiniband backend
        - Added lustre backend
        - Fixed cpuset_rootpath default always appended
        
        Version 0.5.3
        -------------
        
        Released on April 29th 2015
        
        - Removed unnecessary lock from the collector to avoid colmet to wait forever
        - Removed (async) zmq eventloop and added ``--sample-period`` to the collector.
        - Fixed some bugs about hdf file
        
        Version 0.5.2
        -------------
        
        Released on Apr 2nd 2015
        
        - Fixed python syntax error
        
        
        Version 0.5.1
        -------------
        
        Released on Apr 2nd 2015
        
        - Fixed error about missing ``requirements.txt`` file in the sdist package
        
        
        Version 0.5.0
        -------------
        
        Released on Apr 2nd 2015
        
        - Don't run colmet as a daemon anymore
        - Maintained compatibility with zmq 3.x/4.x
         - Dropped ``--zeromq-swap`` (swap was dropped from zmq 3.x)
         - Handled zmq name change from HWM to SNDHWM and RCVHWM
        - Fixed requirements
        - Dropped python 2.6 support
        
        Version 0.4.0
        -------------
        
        - Saved metrics in new HDF5 file if colmet is reloaded in order to avoid HDF5 data corruption
        - Handled HUP signal to reload ``colmet-collector``
        - Removed ``hiwater_rss`` and ``hiwater_vm`` collected metrics.
        
        
        Version 0.3.1
        -------------
        
        - New metrics ``hiwater_rss`` and ``hiwater_vm`` for taskstats
        - Worked with pyinotify 0.8
        - Added ``--disable-procstats`` option to disable procstats backend.
        
        
        Version 0.3.0
        -------------
        
        - Divided colmet package into three parts
        
          - colmet-node : Retrieve data from taskstats and procstats and send to
            collectors with ZeroMQ
          - colmet-collector : A collector that stores data received by ZeroMQ in a
            hdf5 file
          - colmet-common : Common colmet part.
        - Added some parameters of ZeroMQ backend to prevent a memory overflow
        - Simplified the command line interface
        - Dropped rrd backend because it is not yet working
        - Added ``--buffer-size`` option for collector to define the maximum number of
          counters that colmet should queue in memory before pushing it to output
          backend
        - Handled SIGTERM and SIGINT to terminate colmet properly
        
        Version 0.2.0
        -------------
        
        - Added options to enable hdf5 compression
        - Support for multiple job by cgroup path scanning
        - Used Inotify events for job list update
        - Don't filter packets if no job_id range was specified, especially with zeromq
          backend
        - Waited the cgroup_path folder creation before scanning the list of jobs
        - Added procstat for node monitoring through fictive job with 0 as identifier
        - Used absolute time take measure and not delay between measure, to avoid the
          drift of measure time
        - Added workaround when a newly cgroup is created without process in it
          (monitoring is suspended upto one process is launched)
        
        
        Version 0.0.1
        -------------
        
        - Conception
        
Keywords: monitoring,taskstat,oar,hpc,sciences
Platform: Linux
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: GNU General Public License (GPL)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: System :: Monitoring
Classifier: Topic :: System :: Clustering
Classifier: Programming Language :: Python :: 3.5
Description-Content-Type: text/markdown
