Metadata-Version: 2.1
Name: arabica
Version: 1.0.4
Summary: Python package for exploratory text data analysis
Home-page: https://github.com/PetrKorab/Arabica
Author: Petr Koráb
Author-email: Petr Korab <xpetrkorab@gmail.com>
License: MIT License
        
Project-URL: Homepage, https://github.com/PetrKorab/Arabica
Project-URL: Bug Tracker, https://github.com/PetrKorab/Arabica/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE

# Arabica
**Python package for exploratory text data analysis**

Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include Twitter tweets, product reviews, and newspaper headlines. Arabica provides functions to make the exploratory analysis of such datasets simple.


Arabica provides these methods:

* **arabica_freq**: calculates unigram, bigram, and trigram frequencies over a period (year, month, day)

It can apply all or a selected combination of the following cleaning operations:

* Remove digits from the text
* Remove punctuation from the text
* Remove standard list of stopwords
* Remove an additional specific list of words

Arabica works with **texts** of languages based on the Latin alphabet, uses `clean-text` for punctuation cleaning, and enables stop words removal for languages in the `NLTK` corpus of stopwords.

It reads **dates** in standard date and datetime formats (e.g., 2013–12–31, 2013/12/31, 09-Feb-2009, 2013–12–31 11:46:17, 09/02/2009 09:26).
It is preferable to use the US-style dates (MM/DD/YYYY) rather than the European-style date format (DD/MM/YYYY).

## Installation

Arabica requires [Python 3](https://www.python.org/downloads/),
[NLTK](http://www.nltk.org/install.html),
[clean-text](https://pypi.org/project/cleantext/#description), and
[numpy](https://pypi.org/project/numpy/) to execute. To install using pip, use:

`pip install arabica`



## Usage

* **Import the library**:


``` python
from arabica import arabica_freq
```



* **Choose a method:**

**arabica_freq** returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period.
To remove stopwords, select aggregation period, and choose a specific set of cleaning operations:

``` python
def arabica_freq(text: str,                # Text
                 time: str,                # Time
                 stopwords: [],            # Languages for stop words
                 skip: [],                 # Strings to be skipped
                 punct: bool = False,      # Remove all punctuation
                 lower_case: bool = False, # Make all text lowercase before n-gram calculation
                 max_words: int ='',       # Max number for unigrams, bigrams and trigrams displayed
                 time_freq: str ='',       # Aggregation period: 'Y'/'M'/'D', if no aggregation: 'ungroup'
                 numbers: bool = False     # Remove all digits
) 
```

A list of available languages for stopwords is printed with:
``` python
from nltk.corpus import stopwords
print(stopwords.fileids())
```

It is possible to remove more sets of stopwords at once by `stopwords = ['language 1', 'language2','etc..']`

## Examples

### Time-series n-gram analysis

Returns a table with unigram, bigram, and trigram frequencies over a period of time.


``` python
import pandas as pd
from arabica import arabica_freq
```


``` python
data = pd.DataFrame({'text': ['The ordering process was very easy & straight forward. They have great customer service and sorted any issues out very quickly.',
                              'So far seems to be the wrong product for me :-/ grrrrr...',
                              'Excellent, service, thank you really, really, really much!!!'],
                     'time': ['2013-08-8', '2013-09-8','2014-10-8']})
```

``` python
arabica_freq(text = data['text'],
             time = data['time'],
             time_freq = 'M',           # Calculates monthly n-gram frequencies
             max_words = 2,             # Displays only the first two most frequent unigrams, bigrams, and trigrams
             stopwords = ['english'],   # Removes English set of stopwords
             skip = ['grrrrr'],         # Excludes string from n-gram calculation
             numbers = True,            # Removes numbers
             punct = True,              # Removes punctuation
             lower_case = True)         # Makes all text lowercase before n-gram calculation     
``` 

### Descriptive n-gram analysis

Returns unigram, bigram, and trigram frequencies without period aggregation.

``` python
arabica_freq(text = data['text'],
             time = data['time'],
             time_freq = 'ungroup',        # No aggregation made
             max_words = 2,
             stopwords = ['english'],
             skip = ['grrrrr'],       
             numbers = True,
             punct = True
             lower_case = True)
``` 

## Documentation and tutorials

Read the documentation [here](https://arabica.readthedocs.io/en/latest/index.html). For more examples of coding, read this tutorial:

**Text as Time Series: Arabica 1.0.0 Brings New Features for Exploratory Text Data Analysis** [here](https://towardsdatascience.com/text-as-time-series-arabica-1-0-brings-new-features-for-exploratory-text-data-analysis-88eaabb84deb?sk=229ec0602d0b8514f25bce501ed9ecb9)

## License

##### MIT

For any questions, issues, bugs, and suggestions, please visit [here](https://github.com/PetrKorab/arabica/issues).
