Metadata-Version: 2.1
Name: url-metadata
Version: 0.1.6
Summary: A cache which saves URL metadata and summarizes content
Home-page: https://github.com/seanbreckenridge/url_metadata
Author: Sean Breckenridge
Author-email: seanbrecke@gmail.com
License: http://www.apache.org/licenses/LICENSE-2.0
Description: [![PyPi version](https://img.shields.io/pypi/v/url_metadata.svg)](https://pypi.python.org/pypi/url_metadata) [![Python3.7|Python 3.8](https://img.shields.io/pypi/pyversions/url_metadata.svg)](https://pypi.python.org/pypi/url_metadata) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com)
        
        This is currently not perfect and in development, so expect changes to the API/interface. It aims to walk the line between extracting enough text/data for it to be useful, but no so much that it takes enormous amounts of space.
        
        Current TODOs:
        
        - [ ] Improve CLI interface to match all functions
        - [ ] Improve HTML/text parsing (see [#6](https://github.com/seanbreckenridge/url_metadata/issues/6))
        - [ ] Add more sites using the [abstract interface](https://github.com/seanbreckenridge/url_metadata/blob/master/src/url_metadata/sites/abstract.py), to get more info from sites I use commonly
        - [ ] Add a preprocessing step to the sites abstract interface/URLMetadataCache functions, which 'corrects' URLs, to avoid hash mismatches
        
        A cache which saves URL metadata and summarizes content
        
        This is meant to provide more context to any of my tools which use URLs. If I [watched some youtube video](https://github.com/seanbreckenridge/mpv-sockets/blob/master/DAEMON.md) and I have a URL, I'd like to have the subtitles for it, so I can do a text-search over all the videos I've watched. If I [read an article](https://github.com/seanbreckenridge/ffexport), I want the article text! This requests, parses and abstracts away that data for me locally, so I can just do:
        
        ```python
        >>> from url_metadata import metadata
        >>> m = metadata("https://pypi.org/project/beautifulsoup4/")
        >>> len(m.info["images"])
        46
        >>> m.info["title"]
        'beautifulsoup4'
        >>> m.text_summary[:57]
        "Beautiful Soup is a library that makes it easy to scrape"
        ```
        
        If I ever request the same URL again, that info is grabbed from a local directory cache instead.
        
        ---
        
        ## Installation
        
        Requires `python3.7+`
        
        To install with pip, run:
        
            pip install url_metadata
        
        ---
        
        This uses:
        
        - [`lassie`](https://github.com/michaelhelmick/lassie) to get generic metadata; the title, description, opengraph information, links to images/videos on the page
        - [`readability`](https://github.com/buriy/python-readability) to convert HTML to a summary of the HTML content.
        - [`bs4`](https://pypi.org/project/beautifulsoup4/) to convert the parsed HTML to text (to allow for nicer text searching)
        - [`youtube_subtitles_downloader`](https://github.com/seanbreckenridge/youtube_subtitles_downloader) to get manual/autogenerated captions (converted to a `.srt` file) from Youtube URLs.
        
        ---
        
        ### Usage:
        
        The CLI interface provides some utility commands to get/list information from the cache.
        
        ```
        $ url_metadata --help
        Usage: url_metadata [OPTIONS] COMMAND [ARGS]...
        
        Options:
          --cache-dir PATH          Override default cache directory location
          --debug / --no-debug      Increase log verbosity
          --sleep-time INTEGER      How long to sleep between requests
          --skip-subtitles          Don't attempt to download subtitles
          --subtitle-language TEXT  Subtitle language for Youtube captions
          --help                    Show this message and exit.
        
        Commands:
          cachedir  Prints the location of the local cache directory
          export    Print all cached information as JSON
          get       Get information for one or more URLs Prints results as JSON
          list      List all cached URLs
        ```
        
        ---
        
        In Python, this can be configured by using the `url_metadata.URLMetadataCache` class:
        
        ```python
        url_metadata.URLMetadataCache(loglevel: int = 30,
                                    subtitle_language: str = 'en',
                                    sleep_time: int = 5,
                                    skip_subtitles: bool = False,
                                    cache_dir: Optional[str, pathlib.Path] = None):
            """
            Main interface to the library
        
            subtitle_language: for youtube subtitle requests
            sleep_time: time to wait between HTTP requests
            skip_subtitles: don't attempt to download youtube subtitles
            cache_dir: location the store cached data
                       uses default user cache directory if not provided
            """
        
        get(self, url: str) -> url_metadata.model.Metadata
            """
            Gets metadata/summary for a URL
            Save the parsed information in a local data directory
            If the URL already has cached data locally, returns that instead
            """
        
        get_cache_dir(self, url: str) -> Optional[str]
            """
            If this URL is in cache, returns the location of the cache directory
            Returns None if it couldn't find a matching directory
            """
        
        in_cache(self, url: str) -> bool
            """
            Returns True if the URL already has cached information
            """
        
        request_data(self, url: str) -> url_metadata.model.Metadata
            """
            Given a URL:
        
            If this is a youtube URL, this requests youtube subtitles
            Uses lassie to grab metadata
            Parses the HTML text with readablity
            uses bs4 to parse that text into a plaintext summary
            """
        ```
        
        For example:
        
        ```python
        import logging
        from url_metadata import URLMetadataCache
        
        # make requests every 2 seconds
        # debug logs
        # save to a folder in my home directory
        cache = URLMetadataCache(loglevel=logging.DEBUG, sleep_time=2, cache_dir="~/mydata")
        c = cache.get("https://github.com/seanbreckenridge/url_metadata")
        # just request information, don't read/save to cache
        data = cache.request_data("https://www.wikipedia.org/")
        ```
        
        ### CLI Examples
        
        The `get` command emits `JSON`, so it could with other tools (e.g. [`jq`](https://stedolan.github.io/jq/)) used like:
        
        ```shell
        $ url_metadata get "https://click.palletsprojects.com/en/7.x/arguments/" \
            | jq -r '.[] | .text_summary' | head -n5
        Arguments
        Arguments work similarly to options but are positional.
        They also only support a subset of the features of options due to their
        syntactical nature. Click will also not attempt to document arguments for
        you and wants you to document them manually
        ```
        
        ```shell
        $ url_metadata export | jq -r '.[] | .info | .title'
        seanbreckenridge/youtube_subtitles_downloader
        Arguments — Click Documentation (7.x)
        ```
        
        ```shell
        $ url_metadata list --location
        /home/sean/.local/share/url_metadata/data/b/a/a/c8e05501857a3c7d2d1a94071c68e/000
        /home/sean/.local/share/url_metadata/data/9/4/4/1c380792a3d62302e1137850d177b/000
        ```
        
        ```shell
        # to make a backup of the cache directory
        $ tar -cvzf url_metadata.tar.gz "$(url_metadata cachedir)"
        ```
        
        Accessible through the `url_metadata` script and `python3 -m url_metadata`
        
        ### Implementation Notes
        
        This stores all of this information as individual files in a cache directory (using [`appdirs`](https://github.com/ActiveState/appdirs)). In particular, it `MD5` hashes the URL and stores information like:
        
        ```
        .
        └── 7
            └── b
                └── 0
                    └── d952fd7265e8e4cf10b351b6f8932
                        └── 000
                            ├── epoch_timestamp.txt
                            ├── key
                            ├── metadata.json
                            ├── subtitles.srt
                            ├── summary.html
                            └── summary.txt
        ```
        
        You're free to delete any of the directories in the cache if you want, this doesn't maintain a strict index, it uses a hash of the URL and then searches for a matching `key` file. See comments [here](https://github.com/seanbreckenridge/url_metadata/blob/master/src/url_metadata/cache.py) for implementation details.
        
        By default this waits 5 seconds between requests. Since all the info is cached, I use this by requesting all the info from one data source (e.g. my bookmarks, or videos I've watched recently) in a loop in the background, which saves all the information to my computer. The next time I do that same loop, it doesn't have to make any requests and it just grabs all the info from local cache.
        
        Originally created for [`HPI`](https://github.com/seanbreckenridge/HPI).
        
        ---
        
        ### Testing
        
            git clone 'https://github.com/seanbreckenridge/url_metadata'
            cd ./url_metadata
            git submodule update --init
            pip install '.[testing]'
            mypy ./src/url_metadata/
            pytest
        
Keywords: url cache metadata youtube subtitles
Platform: UNKNOWN
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Description-Content-Type: text/markdown
Provides-Extra: testing
