Metadata-Version: 2.1
Name: epub-conversion
Version: 1.0.12
Summary: Python package for converting xml and epubs to text files
Home-page: https://github.com/JonathanRaiman/epub_conversion
Author: Jonathan Raiman
Author-email: jonathanraiman@gmail.com
License: MIT
Download-URL: https://github.com/JonathanRaiman/epub_conversion
Description: epub conversion
        ---------------
        
        Create text corpuses using epubs and wiki dumps.
        This is a python package with a Converter for epub and xml (wiki dumps) to text, lines, or Python generators.
        
        Usage:
        ------
        
        ### Epub usage
        
        #### Book by book
        
        To convert epubs to text files, usage is straightforward. First create a converter object:
        
        	converter = Converter("my_ebooks_folder/")
        
        Then using this converter let's concatenate all the text within the ebooks into a single mega text file:
        
        	converter.convert("my_succinct_text_file.gz")
        
        #### Line by line
        
        You can also proceed line by line:
        
        	from epub_conversion.utils import open_book, convert_epub_to_lines
        
        	book = open_book("twilight.epub")
        
        	lines = convert_epub_to_lines(book)
        
        ### Wikidump usage
        
        #### Redirections
        
        Suppose you are interested in all redirections in a given Wikipedia dump file
        that is still compressed, then you can access the dump as follows:
        
        
        	wiki = epub_conversion.wiki_decoder.almost_smart_open("enwiki.bz2")
        
        
        Taking this dump as our **input** let us now use a generator to output all pairs of `title` and `redirection title` in this dump:
        
        	redirections = {redirect_from:redirect_to
        		for redirect_from, redirect_to in epub_conversion.wiki_decoder.get_redirection_list(wiki)
        	}
        
        #### Page text
        
        Suppose you are interested in the lines within each page's text section only, then:
        
        
        	for line in epub_conversion.wiki_decoder.convert_wiki_to_lines(wiki):
        		process_line( line )
        
        
        See Also:
        ---------
        
        * [Wikipedia NER](https://github.com/JonathanRaiman/wikipedia_ner) a Python module that uses `epub_conversion` to process Wikipedia dumps and output only the lines that contain page to page links, with the link anchor texts extracted, and all markup removed.
Keywords: XML,epub,tokenization,NLP
Platform: any
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.3
Classifier: Topic :: Text Processing :: Linguistic
Description-Content-Type: text/markdown
