lexnlp.utils package¶
Subpackages¶
- lexnlp.utils.lines_processing package
- lexnlp.utils.tests package
- Submodules
- lexnlp.utils.tests.test_line_processor module
- lexnlp.utils.tests.test_map module
- lexnlp.utils.tests.test_parse_df module
- lexnlp.utils.tests.test_parsed_text_corrector module
- lexnlp.utils.tests.test_parsed_text_quality_estimator module
- lexnlp.utils.tests.test_phrase_finder module
- Module contents
- lexnlp.utils.unicode package
Submodules¶
lexnlp.utils.decorators module¶
-
lexnlp.utils.decorators.safe_failure(func)¶ return None on failure, either skip result if generator
lexnlp.utils.iterating_helpers module¶
-
lexnlp.utils.iterating_helpers.collapse_sequence(sequence: collections.abc.Iterable, predicate: Callable[[Any, Any], Any], accumulator: Any = 0.0) → Any¶
-
lexnlp.utils.iterating_helpers.count_sequence_matches(sequence: collections.abc.Iterable, predicate: Callable[Any, bool]) → int¶
lexnlp.utils.map module¶
lexnlp.utils.parse_df module¶
-
class
lexnlp.utils.parse_df.DataframeEntityParser(dataframe, parse_columns, result_columns=None, preformed_entity=None, priority_sort_column=None, priority_sort_ascending=True, cell_values_separator=';', unique_column_values=True, line_processor: lexnlp.utils.lines_processing.line_processor.LineProcessor = None)¶ Bases:
objectClass that provides ability to extract entities from a text having some collection of entities formed as dataframe. By default it means that dataframe has UNIQUE values in those columns you use for search. Returns dict of start/end positions of found item in a text and other user-defined key-value pairs
- Params:
dataframe: pandas.DataFrame with entities collection
parse_columns: list or tuple - these columns will be used to search their values in a text
result_columns: dict - map like {‘dataframe column name to take a value corresponding with extracted entity’: ‘new_column_name’}
preformed_entity: dict - initial, static key-value pairs to use for each extracted entity
priority_sort_column: str - column name to sort by and get first match if multiple results found, otherwise the first matched row will be used
priority_sort_ascending: bool - sort order for priority_sort_column
cell_values_separator: str or None - multiple values in datafame cell separated by that separator
unique_column_values: bool - dataframe columns have unique values
- E.g.:
>>> parse_columns = ('Kurztitel', 'Titel', 'Abkürzung') >>> result_columns = {'Titel': 'name'} >>> preformed_entity = {'entity_type': 'Laws and Rules', >>> 'source': 'BaFin', >>> 'country': 'Germany'} >>> sort_column = 'Titel' >>> items = DataframeEntityParser( >>> df, parse_columns, result_columns, preformed_entity, sort_column).parse(text)
-
SEARCH_PTN= '(?:^|\\W)({})(?:\\W|$)'¶
-
get_collection_ptn(collection)¶ Convert list of values to regex pattern :param collection: list of entities to search in :return: compilled regex pattern
-
get_entities(text: str)¶
-
get_entities_from_text(text: str) → Generator[[dict, None], None]¶
-
get_entity_list(text)¶
-
get_formed_entity(match, col_name)¶ Get formed entity from matched row in dataframe :param match: re.match object :param col_name: df column name :return: dict
-
get_single_result(rows)¶ By default we mean that all values we filter by in dataframe are UNIQUE, so just take 1st Implement your own logic to choose from multiple matched dataframe rows
-
lexnlp.utils.parse_df.get_entities(text: str, config: pandas.core.frame.DataFrame, parse_columns: Union[List[str], Tuple[str]], result_columns: Optional[dict] = None, preformed_entity: Optional[dict] = None, priority_sort_column: Optional[str] = None, priority_sort_ascending: bool = True, cell_values_separator: Optional[str] = ';', unique_column_values: bool = True) → Generator¶ Simple wrapper around DataframeEntityParser
-
lexnlp.utils.parse_df.get_entity_list(text: str, config: pandas.core.frame.DataFrame, parse_columns: Union[List[str], Tuple[str]], result_columns: Optional[dict] = None, preformed_entity: Optional[dict] = None, priority_sort_column: Optional[str] = None, priority_sort_ascending: bool = True, cell_values_separator: Optional[str] = ';', unique_column_values: bool = True) → List¶ Simple wrapper around DataframeEntityParser