lexnlp.extract.common.copyrights package¶
Submodules¶
lexnlp.extract.common.copyrights.copyright_en_style_parser module¶
Copyright extraction for English using NLTK and NLTK pre-trained maximum entropy classifier.
This module implements basic Copyright extraction functionality in English relying on the pre-trained NLTK functionality, including POS tagger and NE (fuzzy) chunkers.
-
class
lexnlp.extract.common.copyrights.copyright_en_style_parser.CopyrightEnStyleParser¶ Bases:
object-
copyright_dates_re= regex.Regex('\\d{2,}', flags=regex.V0)¶
-
copyright_ptn= '((Copyright\\W\\s*|\\(\\s*[Cc]\\s*\\)\\s*|©)+\\s*(\\d{4}(?:\\s*[-,–]\\s*\\d{4})?)?\\s*(.+))'¶
-
copyright_ptn_re= regex.Regex('((Copyright\\W\\s*|\\(\\s*[Cc]\\s*\\)\\s*|©)+\\s*(\\d{4}(?:\\s*[-,–]\\s*\\d{4})?)?\\s*(.+))', flags=regex.V0)¶
-
classmethod
derive_company_name(ant: lexnlp.extract.common.annotations.copyright_annotation.CopyrightAnnotation, phrase: str) → None¶
-
classmethod
extract_phrases_with_coords(sentence: str) → List[Tuple[str, int, int]]¶
-
static
get_copyright(text: str, return_sources=False) → Generator[[lexnlp.extract.common.annotations.copyright_annotation.CopyrightAnnotation, None], None]¶
-
classmethod
get_copyright_annotations(text: str, return_sources=False) → Generator[[lexnlp.extract.common.annotations.copyright_annotation.CopyrightAnnotation, None], None]¶ Find copyright in text. :param text: :param return_sources: :return:
-
reg_company_name= regex.Regex('[\\p{Lu}]+[\\p{L}\\s]*', flags=regex.V0)¶
-
reg_valid_company_name= regex.Regex('\\p{L}[\\p{L}\\s,]+', flags=regex.V0)¶
-
classmethod
split_copyright_date(ant: lexnlp.extract.common.annotations.copyright_annotation.CopyrightAnnotation) → None¶
-
classmethod
take_best_company_name(names: List[str]) → str¶
-
year_ptn= '(\\d{4}(?:\\s*[-,–]\\s*\\d{4})?)'¶
-
year_ptn_re= regex.Regex('(\\d{4}(?:\\s*[-,–]\\s*\\d{4})?)$', flags=regex.V0)¶
-
lexnlp.extract.common.copyrights.copyright_parser module¶
-
class
lexnlp.extract.common.copyrights.copyright_parser.CopyrightParser(parsing_functions: List[Callable[str, List[lexnlp.extract.common.pattern_found.PatternFound]]], split_params: lexnlp.utils.lines_processing.line_processor.LineSplitParams)¶ Bases:
lexnlp.extract.common.text_pattern_collector.TextPatternCollector-
get_annotations_as_dictionaries() → List[dict]¶
-
make_annotation_from_pattrn(locale: str, ptrn: lexnlp.extract.common.pattern_found.PatternFound, phrase: lexnlp.utils.lines_processing.line_processor.LineOrPhrase) → lexnlp.extract.common.annotations.text_annotation.TextAnnotation¶
-
lexnlp.extract.common.copyrights.copyright_parsing_methods module¶
-
class
lexnlp.extract.common.copyrights.copyright_parsing_methods.CopyrightParsingMethods¶ Bases:
object-
get_company_name_from_match(text: str, company_search_options: str, years: List[Tuple[int, int, int]]) → str¶
-
init_regexes()¶
-
init_trigger_words()¶
-
match_c_years_word(phrase: str) → List[lexnlp.extract.common.pattern_found.PatternFound]¶ - Parameters
phrase – Copyright 1996 – 2019, Siemens
- Returns
{name: ‘1996 – 2019, Siemens’, probability: 100, …}
-
match_word_c_years(phrase: str) → List[lexnlp.extract.common.pattern_found.PatternFound]¶ - Parameters
phrase – © Siemens 1996 – 2019
- Returns
{name: ‘© Siemens 1996 – 2019’, probability: 100, …}
-
pre_process_found_matches(matches: List[lexnlp.extract.common.pattern_found.PatternFound], company_search_options: str) → List[lexnlp.extract.common.copyrights.copyright_pattern_found.CopyrightPatternFound]¶
-
lexnlp.extract.common.copyrights.copyright_pattern_found module¶
-
class
lexnlp.extract.common.copyrights.copyright_pattern_found.CopyrightPatternFound(ptrn: lexnlp.extract.common.pattern_found.PatternFound = None)¶ Bases:
lexnlp.extract.common.pattern_found.PatternFound-
get_detalization_level(text: str) → int¶
-
get_length() → int¶
-
pattern_worse_than_target(p, text: str) → bool¶ check what pattern is better then 2 patterns are considered duplicated “text” may be used in derived classes
-
reg_uppercase= regex.Regex('[\\p{Lu}]+', flags=regex.V0)¶
-