CLI.utils.sentence_tokenizer¶
About:¶
The sentence_tokenizer command allows you to tokenize text into sentences using the SinaTools utility. It provides flexibility in tokenizing at different punctuation marks, including dots, question marks, and exclamation marks. It also allows tokenization at new lines.
Usage:¶
Below is the usage information that can be generated by running sentence_tokenizer -–help.
Usage:
sentence_tokenizer --text=TEXT [options]
sentence_tokenizer --file=FILE [options]
Options:
--text TEXT
Text to be tokenized into sentences.
--file FILE
File containing the text to be tokenized into sentences
--dot
Tokenize at dots.
--new_line
Tokenize at new lines.
--question_mark
Tokenize at question marks.
--exclamation_mark
Tokenize at exclamation marks.
Examples:¶
sentence_tokenizer --text "Your text here. Does it work? Yes! Try with new lines." --dot --question_mark --exclamation_mark
sentence_tokenizer --file "path/to/your/file.txt" --dot --question_mark --exclamation_mark
CLI.utils.corpus_tokenizer¶
About:¶
The corpus_tokenizer command offers functionality to tokenize a corpus and write the results to a CSV file. It recursively searches through a specified directory for text files, tokenizes the content, and outputs the results, including various metadata, to a specified CSV file.
Usage:¶
Below is the usage information that can be generated by running corpus_tokenizer --help.
Usage:
corpus_tokenizer dir_path output_csv
Options:
dir_path
The path to the directory containing the text files.
output_csv
The path to the output CSV file.
Examples:¶
corpus_tokenizer --dir_path "/path/to/text/directory/of/files" --output_csv "outputFile.csv"
The tool only processes text files (with a .txt extension).
- The output CSV will contain the following columns:
‘Row_ID’: a unique identifier for each word token.
‘Sentence ID’: the identifier for the sentence in which the word appears.
‘Docs_Sentence_Word_ID’: a concatenated identifier comprising directory name, file name, sentence id, and word position.
‘Word Position’: the position of the word within the sentence.
Ensure that the text files are appropriately encoded in UTF-8 or compatible formats.
The tool uses the nltk library for sentence and word tokenization. Make sure to have the library installed in your environment.