Easy-Tokenizer¶
Description¶
Most tokenizers are eithor too cumbersom (Neural Network based), or too simple. This simple rule based tokenizer is type, small, and sufficient good. Specially, it handles long strings very often parsed wrong by some simple tokenizers, deal url, email, long digits rather well.
Try with the following script:
easy_tokenizer -s input_text
or
easy_tokenizer -f input_file
Requirements¶
Python 3.6+
Installation¶
pip install easy-tokenizer
Usage¶
easy-tokenizer:
input:
- string: input string to tokenize
- filename: input text file to tokenize
- output: output filename, optional. print out to STDOUT when not set
output:
- a sequence of space separated tokens
examples:¶
# string input
easy-tokenizer -s "this is a simple test."
easy-tokenizer -f foo.txt
easy-tokenizer -f foo.txt -o bar.txt
output will be “this is a simple test .”
Development¶
To install package and its dependencies, run the following from project root directory:
python setup.py install
To work the code and develop the package, run the following from project root directory:
python setup.py develop
To run unit tests, execute the following from the project root directory:
python setup.py test