Metadata-Version: 2.1
Name: hte
Version: 0.0.24
Summary: Extracting content from spesific address books
Author: Eirik Berger
Author-email: eirik.berger@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Description-Content-Type: text/markdown
License-File: LICENSE

# historical-text-extraction (hte)
[![PyPI version](https://badge.fury.io/py/hte.svg)](https://badge.fury.io/py/hte)

Package to extract text from historical documents. The package is written for personal use. 

## Installation

The current release from the PyPI repository:

``` bash
pip install hte
```

The development version from [GitHub](https://github.com/) with:

``` bash
pip install git+ssh://git@github.com/eirikberger/hte.git
```
Note that it is nessecary with a SSH key for this approach to work. 

## Using it

Import the package

``` python
from hte import digitize
```

The basic setup is the following:

``` python
# Define class
book = digitize.Book("data/finnmark_1968.pdf", "books")

# Run methods on the class
book.CreateFolderStructure()
book.PdfImport(page_info=False, from_page=21, to_page=263)
book.Split(multiple_columns=True)
book.RunOCR(type="splits", export_image=False)
book.CombineCleanGroup(ocr_grouping=True, group_type='norway')
book.RegexStructure("norway")
```

Make sure to install the correct language package for Tesseract. 

``` bash
# Check languages already installed: 
tesseract --list-langs

# Languages available for installation
apt-cache search tesseract-ocr

# Install the Norwegian language pack
sudo apt-get install tesseract-ocr-nor
```

## Extracting headers

Start by converting xml files to json. These files can be created by using the free software [`labelImg`](https://github.com/heartexlabs/labelImg). 

``` python
import os 
os.chdir('/home/eirikb/Desktop')
```

``` python
header = Headers('train', 2022)
header.runbbxConverting()
```

Then convert the json file to csv. 

``` python
convertFromJson('/home/eirikb/Desktop/training_xml/json-bbox', 'xml')
```

Finally, read the concent of the boxes. 

``` python
ReadBoxes('json-bbox/xml.csv', 'hordaland', 'train', print_images=True)
``` 
