easy_tokenizer package¶

Submodules¶

easy_tokenizer.patterns module¶

class to store knowledges used for tokenization

class easy_tokenizer.patterns.Patterns¶

Bases: object

Contains a set of special chars could be used for tokenization’

ABBREV_RE = re.compile('(\\w\\.){2,}|(?:jan|feb|mar|apr|jun|jul|aug|sep|Sept|sept|SEPT|oct|nov|dec)\\.')¶

ALL_WEB_CAPTURED_RE = re.compile('((?:(?:http[s]?|ftp)://|wwww?[.])(?:[a-zA-Z]|[0-9]|[-_:\\/?@.&+=]|(?:%[0-9a-fA-F][0-9a-fA-F]))+|(?:[0-9a-zA-Z][-\\w_]+)(?:\\.[0-9a-zA-Z][-\\w_]+){2,5}(?:(?:\\/(?:[0-9a-zA-Z]|[-_?.#=:&%])+)+)?\\/?|\\S)¶

ALL_WEB_RE = re.compile('(?:(?:http[s]?|ftp)://|wwww?[.])(?:[a-zA-Z]|[0-9]|[-_:\\/?@.&+=]|(?:%[0-9a-fA-F][0-9a-fA-F]))+|(?:[0-9a-zA-Z][-\\w_]+)(?:\\.[0-9a-zA-Z][-\\w_]+){2,5}(?:(?:\\/(?:[0-9a-zA-Z]|[-_?.#=:&%])+)+)?\\/?|\\S+)¶

COMMON_HYPHEN_START = ['e', 'i', 're', 'ex', 'self', 'fore', 'all', 'low', 'high']¶

DIGITS_CAPTURED_RE = re.compile('((?:\\b|^)[-+±~]?(?:\\d[-.,0-9\\/#]*\\d|\\d+(?:st|nd|rd|th|[dD])?)[%]?(?:\\b|$))')¶

DIGITS_RE = re.compile('(?:\\b|^)[-+±~]?(?:\\d[-.,0-9\\/#]*\\d|\\d+(?:st|nd|rd|th|[dD])?)[%]?(?:\\b|$)')¶

DIGIT_RE = re.compile('\\d')¶

DOMAIN_RE = re.compile('[@]\\S+[.]\\S+')¶

EMAIL_RE = re.compile('\\S+[@]\\S+[.]\\S+')¶

HYPHEN_CAPTURED_RE = re.compile('([\\-\\–\\—])')¶

HYPHEN_RE = re.compile('[\\-\\–\\—]')¶

PARA_SEP_RE = re.compile('(\\W|\\+\\-)\\1{4,}')¶

PUNCT_END_PHRASE = frozenset({'’', ',', '"', ']', '»', '“', '”', ';', '?', ')', '[…]', ':', '!', '.', "'"})¶

PUNCT_SEQ_RE = re.compile("[-!\\'#%&`()\\[\\]*+,.\\\\/:;<=>?@^$_{|}~]+")¶

URL_RE = re.compile('(?:[0-9a-zA-Z][-\\w_]+)(?:\\.[0-9a-zA-Z][-\\w_]+){2,5}(?:(?:\\/(?:[0-9a-zA-Z]|[-_?.#=:&%])+)+)?\\/?|(?:(?:http[s]?|ftp)://|wwww?[.])(?:[a-zA-Z]|[0-9]|[-_:\\/?@.&+=]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')¶

WORD_BF_CAPTURED_RE = re.compile('([()\\[\\]{}"“”\\\'`»:;,/\\\\*?!…<=>@^$\\|~%]|[\\u2022\\u2751\\uF000\\uF0FF]|[\\u25A0-\\u25FF]|\\.{2,})')¶

YEAR_RE = re.compile('(?:\\b|^)(?:19|20)\\d\\d(?:\\b|$)')¶

static abbreviation(phrase)¶

all_web_captured_pn = '((?:(?:http[s]?|ftp)://|wwww?[.])(?:[a-zA-Z]|[0-9]|[-_:\\/?@.&+=]|(?:%[0-9a-fA-F][0-9a-fA-F]))+|(?:[0-9a-zA-Z][-\\w_]+)(?:\\.[0-9a-zA-Z][-\\w_]+){2,5}(?:(?:\\/(?:[0-9a-zA-Z]|[-_?.#=:&%])+)+)?\\/?|\\S+[@]\\S+[.]\\S+|[@]\\S+[.]\\S+)'¶

all_web_pn = '(?:(?:http[s]?|ftp)://|wwww?[.])(?:[a-zA-Z]|[0-9]|[-_:\\/?@.&+=]|(?:%[0-9a-fA-F][0-9a-fA-F]))+|(?:[0-9a-zA-Z][-\\w_]+)(?:\\.[0-9a-zA-Z][-\\w_]+){2,5}(?:(?:\\/(?:[0-9a-zA-Z]|[-_?.#=:&%])+)+)?\\/?|\\S+[@]\\S+[.]\\S+|[@]\\S+[.]\\S+'¶

digits_captured_pn = '((?:\\b|^)[-+±~]?(?:\\d[-.,0-9\\/#]*\\d|\\d+(?:st|nd|rd|th|[dD])?)[%]?(?:\\b|$))'¶

digits_pn = '(?:\\b|^)[-+±~]?(?:\\d[-.,0-9\\/#]*\\d|\\d+(?:st|nd|rd|th|[dD])?)[%]?(?:\\b|$)'¶

domain_pn = '[@]\\S+[.]\\S+'¶

email_pn = '\\S+[@]\\S+[.]\\S+'¶

hyphen_pn = '[\\-\\–\\—]'¶

known_month_pn = '(?:jan|feb|mar|apr|jun|jul|aug|sep|Sept|sept|SEPT|oct|nov|dec)\\.'¶

months = ['jan', 'feb', 'mar', 'apr', 'jun', 'jul', 'aug', 'sep', 'Sept', 'sept', 'SEPT', 'oct', 'nov', 'dec']¶

repeat_abbrev_pn = '(\\w\\.){2,}'¶

si_units = ['m²', 'fm', 'cm²', 'm³', 'cm³', 'l', 'ltr', 'dl', 'cl', 'ml', '°C', '°F', 'K', 'g', 'gr', 'kg', 't', 'mg', 'μg', 'm', 'km', 'mm', 'μm', 'cm', 'sm', 's', 'ms', 'μs', 'Nm', 'klst', 'min', 'W', 'mW', 'kW', 'MW', 'GW', 'TW', 'J', 'kJ', 'MJ', 'GJ', 'TJ', 'kWh', 'MWh', 'kWst', 'MWst', 'kcal', 'cal', 'N', 'kN', 'V', 'v', 'mV', 'kV', 'A', 'mA', 'Hz', 'kHz', 'MHz', 'GHz', 'Pa', 'hPa', '°', '°c', '°f']¶

url_pn = '(?:[0-9a-zA-Z][-\\w_]+)(?:\\.[0-9a-zA-Z][-\\w_]+){2,5}(?:(?:\\/(?:[0-9a-zA-Z]|[-_?.#=:&%])+)+)?\\/?'¶

url_strict_pn = '(?:(?:http[s]?|ftp)://|wwww?[.])(?:[a-zA-Z]|[0-9]|[-_:\\/?@.&+=]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'¶

word_bf_captured_pn = '([()\\[\\]{}"“”\\\'`»:;,/\\\\*?!…<=>@^$\\|~%]|[\\u2022\\u2751\\uF000\\uF0FF]|[\\u25A0-\\u25FF]|\\.{2,})'¶

word_bf_pn = '[()\\[\\]{}"“”\\\'`»:;,/\\\\*?!…<=>@^$\\|~%]|[\\u2022\\u2751\\uF000\\uF0FF]|[\\u25A0-\\u25FF]|\\.{2,}'¶

easy_tokenizer.token_with_pos module¶

Class for tokens with position information

class easy_tokenizer.token_with_pos.TokenWithPos(text, start, end)¶

Bases: object

TokenWithPos: token with start and end position in the text: attributes: - text: text in the normalized form - start: start position - end: end position

easy_tokenizer.tokenizer module¶

Tokenizer Class

class easy_tokenizer.tokenizer.Tokenizer(regexp=None)¶

Bases: object

A basic Tokenizer class to tokenize strings and patterns

Parameters:

regexp: regexp used to tokenize the string

tokenize(text)¶

params:

text: string
pos_info: also output the position information when tokenizing

output: tokens (with position info)

tokenize_with_pos_info(text)¶

tokenize

params:

text: string

output:

a list of Token object

Module contents¶

Top-level package for easy-tokenizer

easy_tokenizer.define_logger(mod_name)¶: Set the default logging configuration

easy_tokenizer.set_logging_level(level=30)¶: Change logging level