Metadata-Version: 2.1
Name: feast-hive
Version: 0.17.0
Summary: Hive support for Feast offline store
Home-page: https://github.com/baineng/feast-hive
Author: Benn Ma
Author-email: bennmsg@gmail.com
License: Apache License, Version 2.0
Project-URL: Bug Tracker, https://github.com/baineng/feast-hive/issues
Keywords: feast featurestore hive offlinestore
Platform: UNKNOWN
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Requires-Python: >=3.7.0
Description-Content-Type: text/markdown
Provides-Extra: dev
Provides-Extra: test
License-File: LICENSE

# Feast Hive Support

Hive is not included in current [Feast](https://github.com/feast-dev/feast) roadmap, this project intends to add Hive support for Offline Store.  
For more details, can check [this Feast issue](https://github.com/feast-dev/feast/issues/1686).

**The public releases have passed all integration tests, please create an issue if you got any problem.**

## Change Logs
- DONE [v0.1.1] ~~I am working on the first workable version, think it will be released in a couple of days.~~
- DONE [v0.1.2] ~~Allow custom hive conf when connect to a HiveServer2~~
- DONE [v0.14.0] ~~Support Feast 0.14.x~~
- DONE [v0.17.0] ~~Support Feast 0.17.0~~
- TODO It currently supports `insert into` for uploading entity_df, which is a little inefficient, gonna add extra parameters for people who are able to provide HDFS address in next version (for uploading to HDFS). 

## Quickstart

#### Install feast

```shell
pip install feast
```

#### Install feast-hive

- Install stable version

```shell
pip install feast-hive 
```

- Install develop version (not stable):

```shell
pip install git+https://github.com/baineng/feast-hive.git 
```

#### Create a feature repository

```shell
feast init feature_repo
cd feature_repo
```

#### Edit `feature_store.yaml`

set `offline_store` type to be `feast_hive.HiveOfflineStore`

```yaml
project: ...
registry: ...
provider: local
offline_store:
    type: feast_hive.HiveOfflineStore
    host: localhost
    port: 10000        # optional, default is `10000`
    database: default  # optional, default is `default`
    hive_conf:         # optional, hive conf overlay
      hive.join.cache.size: 14797
      hive.exec.max.dynamic.partitions: 779
    ... # other parameters
online_store:
    ...
```

#### Create Hive Table

1. Upload `data/driver_stats.parquet` to HDFS
```shell
hdfs dfs -copyFromLocal ./data/driver_stats.parquet /tmp/
```
2. Create Hive Table
```sql
CREATE TABLE driver_stats (
    event_timestamp   bigint,
    driver_id         bigint,
    conv_rate         float,
    acc_rate          float,
    avg_daily_trips   int,
    created           bigint
)
STORED AS PARQUET;
```
3. Load data into the table
```sql
LOAD DATA INPATH '/tmp/driver_stats.parquet' INTO TABLE driver_stats;
```

#### Edit `example.py`

```python
# This is an example feature definition file

from google.protobuf.duration_pb2 import Duration

from feast import Entity, Feature, FeatureView, ValueType
from feast_hive import HiveSource

# Read data from Hive table
# Here we use a Query to reuse the original parquet data, 
# but you can replace to your own Table or Query.
driver_hourly_stats = HiveSource(
    # table='driver_stats',
    query = """
    SELECT Timestamp(cast(event_timestamp / 1000000 as bigint)) AS event_timestamp, 
           driver_id, conv_rate, acc_rate, avg_daily_trips, 
           Timestamp(cast(created / 1000000 as bigint)) AS created 
    FROM driver_stats
    """,
    event_timestamp_column="event_timestamp",
    created_timestamp_column="created",
)

# Define an entity for the driver.
driver = Entity(name="driver_id", value_type=ValueType.INT64, description="driver id", )

# Define FeatureView
driver_hourly_stats_view = FeatureView(
    name="driver_hourly_stats",
    entities=["driver_id"],
    ttl=Duration(seconds=86400 * 1),
    features=[
        Feature(name="conv_rate", dtype=ValueType.FLOAT),
        Feature(name="acc_rate", dtype=ValueType.FLOAT),
        Feature(name="avg_daily_trips", dtype=ValueType.INT64),
    ],
    online=True,
    batch_source=driver_hourly_stats,
    tags={},
)
```

#### Apply the feature definitions

```shell
feast apply
```

#### Generating training data and so on

The rest are as same as [Feast Quickstart](https://docs.feast.dev/quickstart#generating-training-data)


## Developing and Testing

#### Developing

```shell
git clone https://github.com/baineng/feast-hive.git
cd feast-hive
# creating virtual env ...
pip install -e ".[dev]"

# before commit
make format
make lint
```

#### Testing

```shell
pip install -e ".[test]"
pytest -n 6 --host=localhost --port=10000 --database=default
```

