Metadata-Version: 2.1
Name: elmoformanylangs
Version: 0.0.4.post2
Summary: ELMo, updated to be usable with models for many languages
Home-page: https://github.com/HIT-SCIR/ELMoForManyLangs
Author: Research Center for Social Computing and Information Retrieval
License: UNKNOWN
Description: Pre-trained ELMo Representations for Many Languages
        ===================================================
        
        We release our ELMo representations trained on many languages
        which helps us win the [CoNLL 2018 shared task on Universal Dependencies Parsing](http://universaldependencies.org/conll18/results.html)
        according to LAS.
        
        ## Technique Details
        
        We use the same hyperparameter settings as [Peters et al. (2018)](https://arxiv.org/abs/1802.05365) for the biLM
        and the character CNN.
        We train their parameters
        on a set of 20-million-words data randomly
        sampled from the raw text released by the shared task (wikidump + common crawl) for each language.
        We largely based ourselves on the code of [AllenNLP](https://allennlp.org/), but made the following changes:
        
        * We support unicode characters;
        * We use the *sample softmax* technique
        to make training on large vocabulary feasible ([Jean et al., 2015](https://arxiv.org/abs/1412.2007)).
        However, we use a window of words surrounding the target word
        as negative samples and it shows better performance in our preliminary experiments.
        
        The training of ELMo on one language takes roughly 3 days on an NVIDIA P100 GPU.
        
        
        ## Downloads
        
        |   |   |   |   |
        |---|---|---|---|
        | [Arabic](http://vectors.nlpl.eu/repository/11/136.zip) | [Bulgarian](http://vectors.nlpl.eu/repository/11/137.zip) | [Catalan](http://vectors.nlpl.eu/repository/11/138.zip) | [Czech](http://vectors.nlpl.eu/repository/11/139.zip) |
        | [Old Church Slavonic](http://vectors.nlpl.eu/repository/11/140.zip) | [Danish](http://vectors.nlpl.eu/repository/11/141.zip) | [German](http://vectors.nlpl.eu/repository/11/142.zip) | [Greek](http://vectors.nlpl.eu/repository/11/143.zip) |
        | [English](http://vectors.nlpl.eu/repository/11/144.zip) | [Spanish](http://vectors.nlpl.eu/repository/11/145.zip) | [Estonian](http://vectors.nlpl.eu/repository/11/146.zip) | [Basque](http://vectors.nlpl.eu/repository/11/147.zip) |
        | [Persian](http://vectors.nlpl.eu/repository/11/148.zip) | [Finnish](http://vectors.nlpl.eu/repository/11/149.zip) | [French](http://vectors.nlpl.eu/repository/11/150.zip) | [Irish](http://vectors.nlpl.eu/repository/11/151.zip) |
        | [Galician](http://vectors.nlpl.eu/repository/11/152.zip) | [Ancient Greek](http://vectors.nlpl.eu/repository/11/153.zip) | [Hebrew](http://vectors.nlpl.eu/repository/11/154.zip) | [Hindi](http://vectors.nlpl.eu/repository/11/155.zip) |
        | [Croatian](http://vectors.nlpl.eu/repository/11/156.zip) | [Hungarian](http://vectors.nlpl.eu/repository/11/157.zip) | [Indonesian](http://vectors.nlpl.eu/repository/11/158.zip) | [Italian](http://vectors.nlpl.eu/repository/11/159.zip) |
        | [Japanese](http://vectors.nlpl.eu/repository/11/160.zip) | [Korean](http://vectors.nlpl.eu/repository/11/161.zip) | [Latin](http://vectors.nlpl.eu/repository/11/162.zip) | [Latvian](http://vectors.nlpl.eu/repository/11/163.zip) |
        | [Norwegian Bokmål](http://vectors.nlpl.eu/repository/11/165.zip) | [Dutch](http://vectors.nlpl.eu/repository/11/164.zip) | [Norwegian Nynorsk](http://vectors.nlpl.eu/repository/11/166.zip) | [Polish](http://vectors.nlpl.eu/repository/11/167.zip) |
        | [Portuguese](http://vectors.nlpl.eu/repository/11/168.zip) | [Romanian](http://vectors.nlpl.eu/repository/11/169.zip) | [Russian](http://vectors.nlpl.eu/repository/11/170.zip) | [Slovak](http://vectors.nlpl.eu/repository/11/171.zip) |
        | [Slovene](http://vectors.nlpl.eu/repository/11/172.zip) | [Swedish](http://vectors.nlpl.eu/repository/11/173.zip) | [Turkish](http://vectors.nlpl.eu/repository/11/174.zip) | [Uyghur](http://vectors.nlpl.eu/repository/11/175.zip) |
        | [Ukrainian](http://vectors.nlpl.eu/repository/11/176.zip) | [Urdu](http://vectors.nlpl.eu/repository/11/177.zip) | [Vietnamese](http://vectors.nlpl.eu/repository/11/178.zip) | [Chinese](http://vectors.nlpl.eu/repository/11/179.zip) |
        
        The models are hosted on the [NLPL Vectors Repository](http://wiki.nlpl.eu/index.php/Vectors/home).
        
        **ELMo for Simplified Chinese**
        
        We also provided [simplified-Chinese ELMo](http://39.96.43.154/zhs.model.tar.bz2).
        It was trained on xinhua proportion of [Chinese gigawords-v5](https://catalog.ldc.upenn.edu/ldc2011t13),
        which is different from the Wikipedia for traditional Chinese ELMo.
        
        ## Pre-requirements
        
        * **must** python >= 3.6 (if you use python3.5, you will encounter this issue https://github.com/HIT-SCIR/ELMoForManyLangs/issues/8)
        * pytorch 0.4
        * other requirements from allennlp
        
        ## Usage
        
        
        ### Install the package
        
        You need to install the package to use the embeddings with the following commends
        ```
        python setup.py install
        ```
        
        ### Set up the `config_path`
        After unzip the model, you will find a JSON file `${lang}.model/config.json`.
        Please change the `"config_path"` field to the relative path to 
        the model configuration `cnn_50_100_512_4096_sample.json`.
        For example, if your ELMo model is `zht.model/config.json` and your model configuration
        is `zht.model/cnn_50_100_512_4096_sample.json`, you need to change `"config_path"`
        in `zht.model/config.json` to `cnn_50_100_512_4096_sample.json`.
        
        If there is no configuration `cnn_50_100_512_4096_sample.json` under `${lang}.model`,
        you can copy the `elmoformanylangs/configs/cnn_50_100_512_4096_sample.json` into `${lang}.model`,
        or change the `"config_path"` into  `elmoformanylangs/configs/cnn_50_100_512_4096_sample.json`.
        
        See [issue 27](https://github.com/HIT-SCIR/ELMoForManyLangs/issues/27) for more details. 
        
        
        ### Use ELMoForManyLangs in command line
        
        Prepare your input file in the [conllu format](http://universaldependencies.org/format.html), like
        ```
        1   Sue    Sue    _   _   _   _   _   _   _
        2   likes  like   _   _   _   _   _   _   _
        3   coffee coffee _   _   _   _   _   _   _
        4   and    and    _   _   _   _   _   _   _
        5   Bill   Bill   _   _   _   _   _   _   _
        6   tea    tea    _   _   _   _   _   _   _
        ```
        Fileds should be separated by `'\t'`. We only use the second column and space (`' '`) is supported in
        this field (for Vietnamese, a word can contains spaces).
        Do remember tokenization!
        
        When it's all set, run
        
        ```
        $ python -m elmoformanylangs test \
            --input_format conll \
            --input /path/to/your/input \
            --model /path/to/your/model \
            --output_prefix /path/to/your/output \
            --output_format hdf5 \
            --output_layer -1
        ```
        
        It will dump an hdf5 encoded `dict` onto the disk, where the key is `'\t'` separated
        words in the sentence and the value is it's 3-layer averaged ELMo representation.
        You can also dump the cnn encoded word with `--output_layer 0`,
        the first layer of the LsTM with `--output_layer 1` and the second layer
        of the LSTM with `--output_layer 2`.  
        We are actively changing the interface to make it more adapted to the 
        AllenNLP ELMo and more programmatically friendly.
        
        ### Use ELMoForManyLangs programmatically
        
        Thanks @voidism for contributing the API.
        By using `Embedder` python object, you can use ELMo into your own code like this:
        
        ```python
        from elmoformanylangs import Embedder
        
        e = Embedder('/path/to/your/model/')
        
        sents = [['今', '天', '天氣', '真', '好', '阿'],
        ['潮水', '退', '了', '就', '知道', '誰', '沒', '穿', '褲子']]
        # the list of lists which store the sentences 
        # after segment if necessary.
        
        e.sents2elmo(sents)
        # will return a list of numpy arrays 
        # each with the shape=(seq_len, embedding_size)
        ```
        
        #### the parameters to init Embedder:
        ```python
        class Embedder(model_dir='/path/to/your/model/', batch_size=64):
        ```
        - **model_dir**: the absolute path from the repo top dir to you model dir.
        - **batch_size**: the batch_size you want when the model inference, you can specify it properly according to your gpu/cpu ram size. (default: 64)
        
        #### the parameters of the function sents2elmo:
        ```python
        def sents2elmo(sents, output_layer=-1):
        ```
        - **sents**: the list of lists which store the sentences after segment if necessary.
        - **output_layer**: the target layer to output. 
            -  0 for the word encoder
            -  1 for the first LSTM hidden layer
            -  2 for the second LSTM hidden layer
            -  -1 for an average of 3 layers. (default)
            -  -2 for all 3 layers
        
        ## Training Your Own ELMo
        
        Please run 
        ```
        $ python -m elmoformanylangs.biLM train -h
        ```
        to get more details about the ELMo training. 
        
        Here is an example for training English ELMo.
        ```
        $ less data/en.raw
        ... (snip) ...
        Notable alumni
        Aris Kalafatis ( Acting )
        Labour Party
        They build an open nest in a tree hole , or man - made nest - boxes .
        Legacy
        ... (snip) ...
        
        $ python -m elmoformanylangs.biLM train \
            --train_path data/en.raw \
            --config_path elmoformanylangs/configs/cnn_50_100_512_4096_sample.json \
            --model output/en \
            --optimizer adam \
            --lr 0.001 \
            --lr_decay 0.8 \
            --max_epoch 10 \
            --max_sent_len 20 \
            --max_vocab_size 150000 \
            --min_count 3
        ```
        However, we
        need to add that the training process is not very stable.
        In some cases, we end up with a loss of `nan`. We are actively working on that and hopefully
        improve it in the future.
        
        ## Citation
        
        If our ELMo gave you nice improvements, please cite us.
        
        ```
        @InProceedings{che-EtAl:2018:K18-2,
          author    = {Che, Wanxiang  and  Liu, Yijia  and  Wang, Yuxuan  and  Zheng, Bo  and  Liu, Ting},
          title     = {Towards Better {UD} Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation},
          booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
          month     = {October},
          year      = {2018},
          address   = {Brussels, Belgium},
          publisher = {Association for Computational Linguistics},
          pages     = {55--64},
          url       = {http://www.aclweb.org/anthology/K18-2005}
        }
        ```
        
        Please also cite the 
        [NLPL Vectors Repository](http://wiki.nlpl.eu/index.php/Vectors/home)
        for hosting the models.
        ```
        @InProceedings{fares-EtAl:2017:NoDaLiDa,
          author    = {Fares, Murhaf  and  Kutuzov, Andrey  and  Oepen, Stephan  and  Velldal, Erik},
          title     = {Word vectors, reuse, and replicability: Towards a community repository of large-text resources},
          booktitle = {Proceedings of the 21st Nordic Conference on Computational Linguistics},
          month     = {May},
          year      = {2017},
          address   = {Gothenburg, Sweden},
          publisher = {Association for Computational Linguistics},
          pages     = {271--276},
          url       = {http://www.aclweb.org/anthology/W17-0237}
        }
        ```
        
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.6
Description-Content-Type: text/markdown
