CLI.utils.corpus_tokenizer¶
About:¶
The sina_corpus_tokenizer command offers functionality to tokenize a corpus and write the results to a CSV file. It recursively searches through a specified directory for text files, tokenizes the content, and outputs the results, including various metadata, to a specified CSV file.
Usage:¶
Below is the usage information that can be generated by running sina_corpus_tokenizer --help.
Usage:
sina_corpus_tokenizer dir_path output_csv
Positional Arguments:
dir_path
The path to the directory containing the text files.
output_csv
The path to the output CSV file.
Examples:¶
sina_corpus_tokenizer --dir_path "/path/to/text/directory/of/files" --output_csv "outputFile.csv"
The tool only processes text files (with a .txt extension).
- The output CSV will contain the following columns:
‘Row_ID’: a unique identifier for each word token.
‘Sentence ID’: the identifier for the sentence in which the word appears.
‘Docs_Sentence_Word_ID’: a concatenated identifier comprising directory name, file name, sentence id, and word position.
‘Word Position’: the position of the word within the sentence.
Ensure that the text files are appropriately encoded in UTF-8 or compatible formats.
The tool uses the nltk library for sentence and word tokenization. Make sure to have the library installed in your environment.